ACQL: AN ADAPTIVE CONSERVATIVE Q-LEARNING FRAMEWORK FOR OFFLINE REINFORCEMENT LEARN-ING

Abstract

Offline Reinforcement Learning (RL), which relies only on static datasets without additional interactions with the environment, provides an appealing alternative to learning a safe and promising control policy. Most existing offline RL methods did not consider relative data quality and only crudely constrained the distribution gap between the learned policy and the behavior policy in general. Moreover, these algorithms cannot adaptively control the conservative level in more fine-grained ways, like for each state-action pair, leading to a performance drop especially over highly diversified datasets. In this paper, we propose an Adaptive Conservative Q-Learning (ACQL) framework that enables more flexible control over the conservative level of Q-function for offline RL. Specifically, we present two adaptive weight functions to shape the Q-values for collected and out-ofdistribution data. Then we discuss different conditions under which the conservative level of the learned Q-function changes and define the monotonicity with respect to data quality and similarity. Motivated by the theoretical analysis, we propose a novel algorithm with the ACQL framework, using neural networks as the adaptive weight functions. To learn proper adaptive weight functions, we design surrogate losses incorporating the conditions for adjusting conservative levels and a contrastive loss to maintain the monotonicity of adaptive weight functions. We evaluate ACQL on the commonly-used D4RL benchmark and conduct extensive ablation studies to illustrate the effectiveness and state-of-the-art performance compared to existing offline DRL baselines.

1. INTRODUCTION

With the help of deep learning, Reinforcement Learning (RL) has achieved remarkable results on a variety of previously intractable problems, such as playing video games (Silver et al., 2016) , controlling robot (Kalashnikov et al., 2018; Akkaya et al., 2019) and driving autonomous cars (Yu et al., 2020a; Zhao et al., 2022) . However, the prerequisite that the agent has to interact with the environments makes the learning process costly and unsafe for many real-world scenarios. Recently, offline RL (Lange et al., 2012; Prudencio et al., 2022) has been proposed as a promising alternative to relax this requirement. In offline RL, the agent directly learns a control policy from a given static dataset, which is previously-collected by an unknown behavior policy. Offline RL enables the agent to achieve comparable or even better performance without additional interactions with environment. Unfortunately, stripping the interactions from the online RL, offline RL is very challenging due to the distribution shift between the behavior policy and the learned policy. It often leads to the overestimation of values of out-of-distribution (OOD) actions (Kumar et al., 2019; Levine et al., 2020) and thus misleads the policy into choosing these erroneously estimated actions. To alleviate the distribution shift problem, recent methods (Kumar et al., 2019; Jaques et al., 2019; Wu et al., 2019; Siegel et al., 2020) proposed to constrain the learned policy to the behavior policy in different ways, such as limiting the action space (Fujimoto et al., 2019 ), using KL divergence (Wu et al., 2019) and using Maximum Mean Discrepancy (MMD) (Kumar et al., 2019) . Besides directly constraining the policy, other methods (Kumar et al., 2020; Yu et al., 2021a; b; Ma et al., 2021) choose to learn a conservative Q-function to constrain the policy implicitly and thus alleviate the overestimation problem of Q-function. However, most previous methods optimize all transition samples equally rather than selectively adapting, which may be overconservative especially for those non-expert datasets with high data diversity. As shown in Figure 1 , a more conservative (with a larger alpha) CQL (Kumar et al., 2020) agent achieves higher returns on expert dataset while suffering from performance degradation on the random dataset, indicating for high-quality data, higher conservative level works better and vice versa. It clearly shows the importance of the conservative level on the final results. Therefore, it is more proper to use adaptive weights for different transition samples to control the conservative level, such as raising the Q-values more for good actions and less for bad actions. In this paper, we focus on how to constrain the Q-function in a more flexible way and propose a general Adaptive Conservative Q-Learning (ACQL) framework, which sheds the light on how to design a proper conservative Q-function. To achieve more fine-grained control over the conservative level, we use two adaptive weight functions to estimate the conservative weights for each transition sample. In the proposed framework, the form of the adaptive weight functions is not fixed, and we are able to define particular forms according to practical needs. We theoretically discuss in detail that the correlation between the different conservative levels and their corresponding conditions that the weight functions need to satisfy. We also formally define the monotonicity of the weight functions to depict the property that weight functions should raise the Q-values more for good actions and less for bad actions. With the guidance of theoretical conditions, we propose one practical algorithm with learnable neural networks as adaptive weight functions. Overall, ACQL consists of three components. Firstly, we preprocess the fixed dataset to calculate the transition quality measurements, showing the data quality and similarity as pseudo labels. Then, with the help of the measurements, we construct surrogate losses to keep the conservative level of ACQL between the true Q-function and CQL. We also add contrastive loss to maintain the monotonicity of adaptive weight functions with respect to data quality and similarity. Lastly, we train the adaptive weight functions, actor network and critic network alternatively. We summarize our contributions as follows: 1) We propose a more flexible framework ACQL that supports the fine-grained control of the conservative level in offline DRL. 2) We theoretically analyze how the conservative level changes conditioned on different forms of adaptive weight functions. 3) With the guidance of the proposed framework, we present a novel practical algorithm with carefully designed surrogate and contrastive losses to control the conservative levels and monotonicity. 4) We conduct extensive experiments on the D4RL benchmark and the state-of-the-art results well demonstrate the effectiveness of our framework.

2. RELATED WORK

Imitation Learning. To learn from a given static dataset, Imitation Learning (IL) is the most straightforward strategy. The core spirit of IL is to mimic the behavior policy. As the simplest form, behavior cloning still hold a place for offline reinforcement learning, especially for expert dataset. However, having expert datasets is only a minority of cases. Recently some methods (Chen et al., 2020; Siegel et al., 2020; Wang et al., 2020; Liu et al., 2021) aim to filter sub-optimal data and then apply the supervised learning paradigm afterward. Specifically, BAIL (Chen et al., 2020) performed imitation learning only on a high-quality subset of dataset purified by a learned value function. However, these methods often neglect the information contained in the bad actions with lower returns and thus often fail in tasks with non-optimal datasets. We believe these methods are



Figure 1: Performance gaps of CQL with different conservative levels on HalfCheetah tasks.

