ACQL: AN ADAPTIVE CONSERVATIVE Q-LEARNING FRAMEWORK FOR OFFLINE REINFORCEMENT LEARN-ING

Abstract

Offline Reinforcement Learning (RL), which relies only on static datasets without additional interactions with the environment, provides an appealing alternative to learning a safe and promising control policy. Most existing offline RL methods did not consider relative data quality and only crudely constrained the distribution gap between the learned policy and the behavior policy in general. Moreover, these algorithms cannot adaptively control the conservative level in more fine-grained ways, like for each state-action pair, leading to a performance drop especially over highly diversified datasets. In this paper, we propose an Adaptive Conservative Q-Learning (ACQL) framework that enables more flexible control over the conservative level of Q-function for offline RL. Specifically, we present two adaptive weight functions to shape the Q-values for collected and out-ofdistribution data. Then we discuss different conditions under which the conservative level of the learned Q-function changes and define the monotonicity with respect to data quality and similarity. Motivated by the theoretical analysis, we propose a novel algorithm with the ACQL framework, using neural networks as the adaptive weight functions. To learn proper adaptive weight functions, we design surrogate losses incorporating the conditions for adjusting conservative levels and a contrastive loss to maintain the monotonicity of adaptive weight functions. We evaluate ACQL on the commonly-used D4RL benchmark and conduct extensive ablation studies to illustrate the effectiveness and state-of-the-art performance compared to existing offline DRL baselines.

1. INTRODUCTION

With the help of deep learning, Reinforcement Learning (RL) has achieved remarkable results on a variety of previously intractable problems, such as playing video games (Silver et al., 2016) , controlling robot (Kalashnikov et al., 2018; Akkaya et al., 2019) and driving autonomous cars (Yu et al., 2020a; Zhao et al., 2022) . However, the prerequisite that the agent has to interact with the environments makes the learning process costly and unsafe for many real-world scenarios. Recently, offline RL (Lange et al., 2012; Prudencio et al., 2022) has been proposed as a promising alternative to relax this requirement. In offline RL, the agent directly learns a control policy from a given static dataset, which is previously-collected by an unknown behavior policy. Offline RL enables the agent to achieve comparable or even better performance without additional interactions with environment. Unfortunately, stripping the interactions from the online RL, offline RL is very challenging due to the distribution shift between the behavior policy and the learned policy. It often leads to the overestimation of values of out-of-distribution (OOD) actions (Kumar et al., 2019; Levine et al., 2020) and thus misleads the policy into choosing these erroneously estimated actions. To alleviate the distribution shift problem, recent methods (Kumar et al., 2019; Jaques et al., 2019; Wu et al., 2019; Siegel et al., 2020) proposed to constrain the learned policy to the behavior policy in different ways, such as limiting the action space (Fujimoto et al., 2019) , using KL divergence (Wu et al., 2019) and using Maximum Mean Discrepancy (MMD) (Kumar et al., 2019) . Besides directly constraining the policy, other methods (Kumar et al., 2020; Yu et al., 2021a; b; Ma et al., 2021) choose to learn a conservative Q-function to constrain the policy implicitly and thus alleviate the overestimation problem of Q-function. However, most previous methods optimize all transition samples equally rather than selectively adapting, which may be overconservative especially for those non-expert datasets with high data diversity. As shown in Figure 1 , a more conservative (with a larger alpha) CQL (Kumar et al., 2020) agent achieves higher returns on expert dataset while suffering from performance degradation on the random dataset, indicating for high-quality data, higher conservative level works better and vice versa. It clearly shows the importance of the conservative level on the final results. Therefore, it is more proper to use adaptive weights for different transition samples to control the conservative level, such as raising the Q-values more for good actions and less for bad actions. In this paper, we focus on how to constrain the Q-function in a more flexible way and propose a general Adaptive Conservative Q-Learning (ACQL) framework, which sheds the light on how to design a proper conservative Q-function. To achieve more fine-grained control over the conservative level, we use two adaptive weight functions to estimate the conservative weights for each transition sample. In the proposed framework, the form of the adaptive weight functions is not fixed, and we are able to define particular forms according to practical needs. We theoretically discuss in detail that the correlation between the different conservative levels and their corresponding conditions that the weight functions need to satisfy. We also formally define the monotonicity of the weight functions to depict the property that weight functions should raise the Q-values more for good actions and less for bad actions. With the guidance of theoretical conditions, we propose one practical algorithm with learnable neural networks as adaptive weight functions. Overall, ACQL consists of three components. Firstly, we preprocess the fixed dataset to calculate the transition quality measurements, showing the data quality and similarity as pseudo labels. Then, with the help of the measurements, we construct surrogate losses to keep the conservative level of ACQL between the true Q-function and CQL. We also add contrastive loss to maintain the monotonicity of adaptive weight functions with respect to data quality and similarity. Lastly, we train the adaptive weight functions, actor network and critic network alternatively. We summarize our contributions as follows: 1) We propose a more flexible framework ACQL that supports the fine-grained control of the conservative level in offline DRL. 2) We theoretically analyze how the conservative level changes conditioned on different forms of adaptive weight functions. 3) With the guidance of the proposed framework, we present a novel practical algorithm with carefully designed surrogate and contrastive losses to control the conservative levels and monotonicity. 4) We conduct extensive experiments on the D4RL benchmark and the state-of-the-art results well demonstrate the effectiveness of our framework.

2. RELATED WORK

Imitation Learning. To learn from a given static dataset, Imitation Learning (IL) is the most straightforward strategy. The core spirit of IL is to mimic the behavior policy. As the simplest form, behavior cloning still hold a place for offline reinforcement learning, especially for expert dataset. However, having expert datasets is only a minority of cases. Recently some methods (Chen et al., 2020; Siegel et al., 2020; Wang et al., 2020; Liu et al., 2021) aim to filter sub-optimal data and then apply the supervised learning paradigm afterward. Specifically, BAIL (Chen et al., 2020) performed imitation learning only on a high-quality subset of dataset purified by a learned value function. However, these methods often neglect the information contained in the bad actions with lower returns and thus often fail in tasks with non-optimal datasets. We believe these methods are beneficial to our framework ACQL, since they split different data regions where different conservative levels can be set.

Model-free Offline RL.

A large number of model-free offline RL methods aim to maximize the returns while constraining the learned policy and the behavior policy to be close enough. There are various ways for the direct constraint on policy including minimizing the KL-divergence (Jaques et al., 2019; Wu et al., 2019; Zhou et al., 2020) , MMD (Kumar et al., 2019) , or Wasserstein distance (Wu et al., 2019) , and adding behavior cloning regularization (Fujimoto & Gu, 2021) . The policy can be also constrained implicitly by actions space reduction (Fujimoto et al., 2019) , importance sampling based algorithms (Sutton et al., 2016; Nachum et al., 2019) , the implicit form of KL-divergence (Nair et al., 2020; Peng et al., 2019; Simão et al., 2020) ,uncertainty quantification (Agarwal et al., 2020; Kumar et al., 2019) or a conservative Q-function (Kumar et al., 2020; Ma et al., 2021; Sinha et al., 2022) . More recently, Onestep RL (Brandfonbrener et al., 2021) and IQL (Kostrikov et al., 2021b) proposed to improve the policy after the convergence of Q functions. Trajectory Transformer (TT) (Janner et al., 2021) and Decision Transformer (DT) (Chen et al., 2021) leverages the advantage of Transformer (Vaswani et al., 2017) to optimize on trajectories. In this work, we propose a flexible framework ACQL for the model-free methods that constrain on Q-functions. ACQL supports defining different conservative levels for Q-function over each stateaction pair, where CQL (Kumar et al., 2020) is one special case when all levels equal. Model-based Offline RL. Recently, model-based methods (Janner et al., 2019; Kidambi et al., 2020; Yu et al., 2020b; 2021b; Matsushima et al., 2020) attract much attention for offline RL. They first learn the transition dynamics and reward function as a proxy environment which can be subsequently used for policy search. Given the proxy environment, offline methods (Ross & Bagnell, 2012; Kidambi et al., 2020) , or run planning and trajectory optimization like LQR (Tassa et al., 2012) and MCTS (Browne et al., 2012) can be directly used for controlling. Although model-based offline RL can be highly sample efficient, direct use of it can be challenging due to distribution shift issue. In this paper, we only focus on model-free offline RL.

3. PROBLEM STATEMENT

We consider the environment as a fully-observed Markov Decision Process (MDP), which is represented by a tuple (S, A, P , r, ρ 0 , γ). The MDP consists of the state space S, the actions space A, the transition probability distribution function P : S × A × S → [0, 1], the reward function r : S × A × S → R, the initial state distribution ρ 0 (s) and the discount factor γ ∈ (0, 1). The goal is to learn a control policy π(a|s) that maximizes the cumulative discounted return G t = ∞ t=0 γ t r(s t , a t , s t+1 |s 0 ∼ ρ 0 , a t ∼ π(•|s t ), s t+1 ∼ P (•|s t , a t )). In the Actor-Critic framework, the learning process repeatedly alternates between the policy evaluation that computes the value function for a policy and the policy improvement that obtains a better policy from the value function. Given a current replay buffer (dataset) D = {(s, a, r, s ′ )} consisting of finite transition samples, the policy evaluation is defined as follows: Qk+1 ← arg min Q E s,a,s ′ ∼D (B π Qk (s, a) -Q(s, a) 2 , ( ) where k is the iteration number and the Bellman operator is defined as B π Qk (s, a) = r(s, a) + γE a ′ ∼π k (a ′ |s ′ ) [ Qk (s ′ , a ′ )]. Note that the empirical Bellman operator Bπ is used in practical, which backs up only one transition, because it is difficult to contain all possible transitions (s, a, s ′ ) in D, especially for continuous action space. After approximating the Q-function, the policy improvement is performed as the following: πk+1 ← arg max π E s∼D,a∼π k (a|s) Qk+1 (s, a) . Compared to online RL, offline RL only allows learning from a fixed dataset D collected by an unknown behavior policy π β , while prohibiting additional interactions with the environment. One of core issues in offline RL is the existence of the action distribution shift during training (Kumar et al., 2019; Wu et al., 2019; Jaques et al., 2019; Levine et al., 2020) .

4. ADAPTIVE CONSERVATIVE Q-LEARNING FRAMEWORK (ACQL)

In this section, we propose a general framework, Adaptive Conservative Q-Learning (ACQL), which enables more flexible control over the conservative level of Q-function, compared to other Q-function constrained algorithms (Kumar et al., 2020; Ma et al., 2021; Yu et al., 2021a; b) . Without loss of generality, we can consider the dataset collected by the behavior policy usually contains the data with both high and low returns, even though the behavior policy is a random policy. At the same time, among the actions sampled from the particular distribution µ, there are also (relatively) good and bad actions instead of all actions having the same returns. In that case, we need a more flexible and fine-grained control method to constrain the Q-function for each state-action pair. Towards our goal, we propose to use two adaptive weight functions d µ (s, a) and d π β (s, a) to control the conservative level over the distribution µ and empirical behavior policy πβ , respectively. Now the family of optimization problems of our framework ACQL is presented as below: min Q max µ E s∼D,a∼µ(a|s) [d µ (s, a) • Q(s, a)] -E s∼D,a∼π β (a|s) d π β (s, a) • Q(s, a) + 1 2 E s,a,s ′ ∼D Q(s, a) -Bπ Qk (s, a) 2 + R(µ). (3) Note that the form of the adaptive weight functions d µ (s, a) and d π β (s, a) is not fixed and can be customized according to different situations. It is their arbitrary form that supports us in shaping the Q-function more finely. In the following, we discuss the conditions about how we can adjust the conservative level of ACQL compared to the true Q-function and CQL (Kumar et al., 2020) , and the properties that the adaptive weight functions d µ (s, a) and d π β (s, a) should have. First, we list different conditions on which ACQL is more conservative than the true Q-function in different levels in Proposition 4.1. Proposition 4.1 (The conservative level of ACQL). For any µ with supp µ ⊂ supp πβ , without considering the sampling error between the empirical Bπ Q and true Bellman backups B π Q, the conservative level of ACQL can be controlled over the Q-values. The learned Q-function Qπ is more conservative than the true Q-function Q π point-wise, if: ∀s ∈ D, a, d µ • µ -d π β • π β π β ≥ 0. Moreover, as shown in Appendix A, we can also control the conservative level over other regions like the V-values or empirical MDP by changing the state, action space in Equation (4). As a special instance of our framework ACQL, CQL (Kumar et al., 2020) proposes to constraint the conservative level over the excepted V-values. And CQL performs the optimization for all state-action pairs with the same weight α, which may be too rigid and over conservative for some scenarios. Now, if we replace the "≥" to "≤" in Equation (4), we can get the conditions on which ACQL is less conservative than the true Q-function. Following (Auer et al., 2008; Osband et al., 2016; Kumar et al., 2020) , we also show that ACQL bounds the gap between the learned Q-values and true Q-values with the consideration of the sampling error between the empirical Bπ Q and actual Bellman operator B π Q. Due to the space limitation, we present the proposition and proof in Appendix A. Besides the comparison to the true Q-function, we also give a theoretical discussion about the comparison to CQL (Kumar et al., 2020) in the following. Proposition 4.2 (The conservative level compared to CQL). For any µ with supp µ ⊂ supp πβ , given the Q-function learned from CQL is Qπ CQL (s, a) = Q π -α µ-π β π β , similiar as in 4.1, the conservative level of ACQL compared to CQL can be controlled over the Q-values. The learned Q-function Qπ is less conservative than the CQL Q-function Qπ CQL point-wise, if: ∀s ∈ D, a, (α -d µ )µ -(α -d π β )π β π β ≥ 0. Besides the discussion about the conservative level of ACQL, to depict the property that a good action should have a higher Q-value than a bad action, we formally define the monotonicity of the adaptive weight functions as following: Definition 4.1 (The monotonicity of the adaptive weight functions). For any state s ∈ D, the monotonicity of the adaptive weight functions is defined as that a good action a ∈ A with a higher true Q value has a lower d µ value and higher d π β value: ∀s i , s j ∈ D, a i , a j ∈ µ(a|s), d µ (s i , a i ) -d µ (s j , a j ) ∝ Q * (s j , a j ) -Q * (s i , a i ), (6) ∀s i , s j ∈ D, a i , a j ∈ πβ (a|s), d π β (s i , a i ) -d π β (s j , a j ) ∝ Q * (s i , a i ) -Q * (s j , a j ), where Q * is the optimal Q-function. In Definition 4.1, we use the optimal Q-function which is a natural ideal metric to measure the action quality to define the monotonicity. However, it is a illposed problem since if we know the optimal Q-function, we can directly train an optimal policy and solve the problem. And we also do not know what the optimal proportional relationship should be. In the next section 5, we construct a contrastive loss using transition quality measurements to approximate this property for ACQL. All the proofs are provided in Appendix A.

5. ACQL WITH LEARNABLE WEIGHT FUNCTIONS

In this section, derived from the theoretical discussion, we propose one practical ACQL algorithm in the guidance of theoretical conditions with learnable neural network as adaptive weight functions. To adaptively control the conservative level, there are three steps for ACQL. Firstly, we preprocess the fixed dataset to calculate the transition quality measurements. Then, with the help of the transition quality measurements, we construct surrogate losses to control the conservative level of ACQL. We also add contrastive losses to maintain the monotonicity. Lastly, we train the adaptive weight functions, actor network and critic network alternatively.

5.1. TRANSITION QUALITY MEASUREMENTS

To seek a replacement for the optimal Q-function Q * in Definition 4.1, we firstly preprocess the fixed dataset and calculate the transition quality measurements for each state-action pair. Note that different tasks may have different magnitudes of Q-values, thus it is better to seek a normalized measurement ranging in (0, 1). For the action in the fixed dataset, we define Relative Transition Quality m(s, a) by combining both the single step reward and the whole discounted trajectory return to measure the data quality: ∀(s, a) ∈ D, m(s, a) = 1 2 (r norm (s, a) + g norm (s, a)), where r norm = rcur-rmin rmax-rmin and g norm = gcur-gmin gmax-gmin . The r min , r max , g min , g max are the minimum and maximum values of the single step reward and the Monte Carlo return of the whole trajectory in the dataset respectively. Note that the range of m(s, a) is (0, 1) and the higher m(s, a) indicates that a is a better action. For the OOD actions, since they do not appear in the dataset and we are not able to use the single rewards and Monte Carlo returns, we choose to use both the quality of its corresponding in-dataset action m(s, a in ) and the Euclidean distance between the OOD action and action in the dataset with the same state. ∀(s, a in ) ∈ D, a µ ∈ µ(a|s), m(s, a µ ) = 1 2 m(s, a in ) - 1 2 ||a µ -a in || 2 + 1 . Consistent with m(s, a in ), we shift and scale in Equation 9to make the range of m(s, a µ ) is also (0, 1) and the higher m(s, a µ ) indicates that a µ is a better action. Note that m(s, a µ ) is calculated over every training batch. We argue that the above Equations 8 and 9 is only a simple and effective way to be the replacement of the optimal Q-function and it can be served as a baseline for future algorithms. Many methods including the "upper envelope" in BAIL (Chen et al., 2020) and uncertainty estimation in (Yu et al., 2020b) have the potential to be incorporated in ACQL.

5.2. OPTIMIZATION TO CONTROL THE CONSERVATIVE LEVEL

Suppose we present the functions d µ , d π β as deep neural networks, the key point is how to design the loss functions. Recapping the Equations ( 4) and ( 5), we have known the conditions for different conservative level and thus adapt them to the loss functions. More specifically, we incorporate the conditions, which aims to learn a Q-funtion more conservative than true Q-function but less conservative than CQL, into the hinge losses as shown in the following: L cl_true (d µ , d π β ) = max(0, d π β • π β -d µ • µ + C 1 • π β ), (10) L cl_cql (d µ , d π β ) = max(0, (d µ -α) • µ -(d π β -α) • π β + C 2 • π β ), (11) where C 1 and C 2 are used to control the soft margin of the conservative level compared to the true Q-function and CQL respectively. A higher C 1 means a more conservative Q-function than true Qfunction, while a higher C 2 represents a less conservative Q-function than CQL. However, C 1 and C 2 are difficult to tune and not adaptive enough as fixed hyperparameters. We leverage the relative transition quality to calculate C 1 and C 2 automatically based on the property in Definition 4.1 that a good action should have less conservative Q-value and thus a lower C 1 and a higher C 2 . C 1 (s, a) = (1 -m(s, a)) • r max , (12) C 2 (s, a) = m(s, a) • r max . (13) The ranges of C 1 and C 2 are both (0, r max ). Note that C 1 and C 2 are the soft margin over the Q-values, thus we use the maximum single reward r max as the maximum margin to avoid excessive fluctuations in policy evaluation. Nevertheless, during the optimization for L cl_true and L cl_cql , we find it is prone to cause arithmetic underflow since the log µ , log π β is usually very small like -1000, then the resulting µ(a|s) and π β (a|s) become 0 after exponentiation operation. To avoid the arithmetic underflow problem, we use a necessity of Equations ( 4) and ( 5) to form surrogate losses. Lemma 5.1 For x > 0, ln x ≤ x -1. ln x = x -1 if and only if x = 1. The proof is provided in Appendix A. Then the resulting surrogate losses are the following: L cl_true (d µ , d π β ) = max(0, d π β • (ln π β + 1) -d µ • (ln µ + 1) + C 1 • (ln π β + 1)), L cl_cql (d µ , d π β ) = max(0, (d µ -α)(ln µ + 1) -(d π β -α)(ln π β + 1) + C 2 (ln π β + 1)). (15)

5.3. OPTIMIZATION TO MAINTAIN THE MONOTONICITY

Besides the surrogate losses to control the conservative levels of ACQL, we also construct contrastive loss to main the monotonicity as stated in Definition 4.1. We use Mean Squared Error (MSE) for simplicity. The contrastive loss is defined as the following: ∀(s i , a i ), (s j , a j ) / ∈ D, (s k , a k ), (s l , a l ) ∈ D, L mono (d µ , d π β ) = ∥(σ(d µ (s i , a i )) -σ(d µ (s j , a j ))) -(σ(m(s j , a j )) -σ(m(s i , a i )))∥ 2 2 (17) + ∥ σ(d π β (s k , a k )) -σ(d π β (s l , a l )) -(σ(m(s k , a k )) -σ(m(s l , a l )))∥ 2 2 , ( ) where we use the softmax operation σ(•) over the current batch of the training data to unify the orders of magnitude between the adaptive weights and transition quality measurements.

5.4. FINAL OBJECT

To learn a conservative Q-function, the adaptive weight should be positive, and thus we add a regularizer term as the following: L pos (d µ , d π β ) = max(0, -d µ ) + max(0, -d π β ), The final object for training the adaptive weight functions d µ , d π β is: L(d µ , d π β ) = L cl_true (d µ , d π β ) + L cl_cql (d µ , d π β ) + L mono (d µ , d π β ) + L pos (d µ , d π β ). (20) ACQL is built on the top of the CQL (Kumar et al., 2020) , which sets µ = π and R(µ) = -D KL (µ, U nif (a)) in Equation 3 in Section 5. And the optimization problem becomes: We conducted all the experiments on the commonly-used offline RL benchmark D4RL (Fu et al., 2020) , which includes many task domains (Todorov et al., 2012; Brockman et al., 2016; Rajeswaran et al., 2017) and a variety of dataset types. Aiming to provide a comprehensive comparison, we compared ACQL to many state-of-the-art model-free algorithms including behavioral cloning (BC), SAC-off (Haarnoja et al., 2018) , BEAR (Kumar et al., 2019) , BRAC (Wu et al., 2019) , AWR (Peng et al., 2019) , BCQ (Fujimoto et al., 2019) , aDICE (Nachum et al., 2019) TD3+BC (Fujimoto & Gu, 2021) , Fisher-BRC (Kostrikov et al., 2021a) , and CQL (Kumar et al., 2020) . For the sake of fair comparisons, we directly reported the results of all baselines from the D4RL whitepaper (Fu et al., 2020) and their original papers. To be consistent with previous works, we trained ACQL for 1.0 M gradient steps and evaluated for 10 episodes every 1000 training iterations. The results are the average values of 10 episodes over 3 random seeds and are obtained from the workflow proposed by (Kumar et al., 2021) . min Q max π min dµ,dπ β E s∼D,a∼π(a|s) [d µ (s, a) • Q(s, a)] -E s∼D,a∼π β (a|s) d π β (s, a) • Q(s, a) + 1 2 E s,a,s ′ ∼D Q(s, a) -Bπ Qk (s, a) 2 -D KL (π, U nif (a)) + L(d µ , d π β ). Gym-MuJoCo Tasks. The Gym-MuJoCo tasks include "halfcheetah", "hopper" and "walker" with 5 kinds of dataset types of each, ranging from expert data to random data. For brevity, we marked "-expert", "-medium-expert", "-medium-replay", "-medium" and "random" as "-e", "-m-e", "-m-r", "-m" and "-r" respectively. Table 1 shows the normalized returns of all 15 Gym-MuJoCo version-0 tasks. We can observe that ACQL outperforms other baselines on Hopper and Walker environments by a large margin. From the perspective of the dataset types, ACQL is very a balanced algorithm that achieves excelling results on all kinds of datasets with expert, medium and random data. We also built additaional comparisons to more state-of-the-art model-free algorithms including Decision Transformer (DT) (Chen et al., 2021) , AWAC (Nair et al., 2020) , Onestep RL (Brandfonbrener et al., 2021) , TD3+BC (Fujimoto & Gu, 2021) , IQL (Kostrikov et al., 2021a) and CQL (Kumar et al., 2020) on version-2 datasets in Appendix C.1. Adroit Tasks. The adroit tasks (Rajeswaran et al., 2017) are high-dimensional robotic manipulation tasks with sparse reward and include expert data and human demonstrations from narrow distributions. Table 2 shows the normalized returns of all 12 Kitchen tasks. ACQL delivers higher performance than other baselines on "cloned" datasets, i.e., with mixed expert data and human demonstrations. Franka Kitchen Tasks. The Franka Kitchen tasks (Gupta et al., 2019) include complex trajectories as the offline datasets and aim to evaluate the "stitching" ability of the agent on a realistic kitchen environment. Table 3 shows the normalized returns of all 3 Kitchen tasks. We can see that ACQL consistently exceeds CQL, BC and other baselines on all 3 kinds of datasets, since ACQL can control the conservative level adaptively. AntMaze Tasks. The AntMaze tasks mimic real-world robotic navigation tasks that aim to control an "Ant" quadruped robot to reach the goal location from a start location with only 0-1 sparse rewards given. Due to the space limitation, we reported the experimental results in Appendix C.3 

6.2. COMPARISONS AMONG DIFFERENT CONSERVATIVE LEVELS

To demonstrate more intuitively that ACQL can adaptively control the conservative level, we compared ACQL to CQL with different α values ranging from 1, 2, 5, 10, 20. The different α values represent different conservative levels and a higher α means more conservative. Figure 2 plotted the learning curves of ACQL and CQL with 5 kinds of α on the Hopper-v0 environments. We also provided a more detailed Figure 4 and a Table 8 of the quantitative results in Appendix C.2. From the Figure 2 , one trend we can clearly observe is that CQL with a higher conservative level can usually achieve higher performance on expert datasets as opposed to that a lower conservative level is more effective for random datasets. Moreover, it is difficult for CQL to achieve satisfactory performance on all kinds of dataset with a fixed α. It is exactly the issue that the ACQL focuses on. Since ACQL generates adaptive weight for each state-action pair and control the conservative level in a more fine-grained way, ACQL can achieve balanced and state-of-the-art results for different dataset types. Table 4 shows the comparisons of average Q-values over the datasets on HalfCheetah-v0 environments. We can observe that as the α of CQL increase, the learned Q-values decreases, since a higher α represents a higher conservative level. Compared to CQL, the Q-values learned by ACQL is higher than CQL showing ACQL is less conservative than CQL while the Q-values do not explode.

6.3. ABLATION STUDY

Effect of the Proposed Losses. Due to the space limitation, we reported the quantitative results on the Hopper environment including 5 kinds of datasets in Table 5 . More results about other environments are provided in Appendix C.4. As shown in Table 5 , when we only control the conservative level using L cl_true (second row) or L cl_cql (third row), ACQL dropped its performance seriously, with only around 21.0 of normalized returns even on expert dataset. Without using L mono and L pos to limit the range of the adaptive weights, it is prone to learn erroneously adaptive weights and the errors increase like a snowball as the policy evaluation repeats and lead to failure. Visualization of the Adaptive Weights. To further check the monotonicity of the adaptive weight functions, we visualized the adaptive weights and the corresponding relative transition quality measurements for HalfCheetah-random-v0 dataset in Figure 3 . Note that all values are normalized to uniform the magnitudes and are sorted according to the quality measurements. In the left of 

7. CONCLUSION

In this paper, we proposed a flexible framework named Adaptive Conservative Q-Learning (ACQL), which sheds the light on how to control the conservative level of the Q-function in a fine-grained way. In ACQL, two weight functions, corresponding to the out-of-distribution (OOD) actions and actions in the dataset, are introduced to adaptively shape the Q-function. More importantly, the form of these two adaptive weight functions is not fixed and it is possible to define particular forms for different scenarios, e.g., elaborately hand-designed rules or learnable deep neural networks. We provide a detailed theory analysis about how the conservative level of the learned Q-function changes under different conditions and define the monotonicity of the adaptive weight functions. To illustrate the feasibility of our framework, we propose a novel practical algorithm using neural networks as the weight functions. With the guidance of the theoretical analysis, we construct two surrogate and contrastive losses to control the conservative level and maintain the monotonicity. We build extensive experiments on commonly-used offline RL benchmarks and the state-of-the-art results well demonstrate the effectiveness of our method. A PROOFS Proposition A.1 (The conservative level of ACQL). For any µ with supp µ ⊂ supp πβ , without considering the sampling error between the empirical Bπ Q and true Bellman backups B π Q, the conservative level of ACQL can be controlled at three levels according to different conditions: 1) Control over the Q-values. The learned Q-function Qπ ACQL is more conservative than the true Q-function Q π point-wise, if: ∀s ∈ D, a, d µ • µ -d π β • π β π β ≥ 0. ( ) 2) Control over the V-values. The excepted values of learned Q-function Qπ ACQL is more conservative than excepted values of the true Q-function Q π , if: ∀s ∈ D, a d µ • µ -d π β • π β π β ≥ 0. ( ) 3) Control over the empirical MDP. The learned Q-function Qπ ACQL is more conservative over the empirical MDP, if: s∈D a d µ • µ -d π β • π β π β ≥ 0. ( ) Proof of Proposition A.1. Without considering the sampling error between the empirical Bπ Q and true Bellman backups B π Q, we first show the optimization problem of Q-function in ACQL as the following: Qk+1 ACQL ← min Q E s∼D,a∼µ(a|s) [d µ (s, a) • Q(s, a)] -E s∼D,a∼π β (a|s) d π β (s, a) • Q(s, a) + 1 2 E s,a,s ′ ∼D Q(s, a) -B π Qk (s, a) 2 (25) By setting the derivative of Equation 25 to 0, we can obtain the form of the resulting Q-function Qk+1 in ACQL: ∀s ∈ D, a, ∂ Qk+1 ACQL ∂Q = 0 (26) ⇒ d µ • µ -d π β • π β + π β • (Q -B π Qk ) = 0 (27) ⇒ Qk+1 ACQL = B π Qk - d µ • µ -d π β • π β π β . ( ) Note that the true Q-function is only derived from the true Bellman operator Q k+1 = B π Qk . Based on the Equation 28, we have exactly the condition to control the conservative level over the Qfunction as shown in Equation 22. If we want to relax the conservative level like requiring to control over the V-values or the empirical MDP, we can easily relax the Equation 28 to the integration over each state or the whole empirical MDP as shown in Equations ( 23) and ( 24) respectively. For the conditions where we want to make the learned Q-function Qπ ACQL is less conservative than the true Q-function Q π , we can easily replace the "≥" to "≤" in Proposition A.1. Next we show that ACQL bounds the gap between the learned Q-values and true Q-values with the consideration of the sampling error between the empirical Bπ Q and actual Bellman operator B π Q. Following (Auer et al., 2008; Osband et al., 2016; Kumar et al., 2020) , the error can be bounded by leveraging the concentration properties of Bπ . We introduce the the bound in brief here: with high probability ≥ 1 -δ, | Bπ Q -B π Q|(s, a) ≤ C r,P ,δ √ |D(s,a)| , ∀s, a ∈ D, where C r,P ,δ is a constant relating to the reward function r(s, a), environment dynamic P (•|s, a), and δ ∈ (0.1). Proposition A.2 (ACQL bounds the gap between the learned Q-values and true Q-values). Considering the sampling error between the empirical Bπ Q and true Bellman backups B π Q, with a high probability ≥ 1 -δ, the gap between the learned Q-function Qπ ACQL and the true Q-function Q π satisfies the following inequality: ∀s ∈ D, a, g(s, a) -err(s, a) ≤ Qπ ACQL (s, a) -Q π (s, a) ≤ g(s, a) + err(s, a), where g(s, a) = -(I -γP π ) -1 d µ • µ -d π β • πβ πβ (s, a), err(s, a) = (I -γP π ) -1 C r,P ,δ R max (1 -γ |D|) (s, a) ≥ 0. (31) 1) Thus, if d µ (s, a) = d π β (s, a) = α, ∀s ∈ D, a , it is the case of CQL, where the V π lower-bounds the V π with a large α instead of a point-wise lower-bound for Q-function. 2) If g(s, a) ≥ err(s, a), ∃s ∈ D, a, with the left inequality, the learned Q-function Qπ ACQL is more optimistic than the true Q-function Q π in these regions. 3) If g(s, a) ≤ -err(s, a), ∃s ∈ D, a, with the right inequality, the learned Q-function Qπ ACQL is more conservative than the true Q-function Q π in these regions. Note that the term err(s, a) is a positive value for any state-action pair. Given the bounds in Proposition A.2, instead of only knowing the size relationship (i.e., more or less conservative), we can control the fine-grained range of the gap more precisely by carefully designing d µ (s, a) and d π β (s, a). Proof of Proposition A.2. In Proposition A.1, we calculate the gap between the learned Q-function Qπ ACQL and the true Q-function Q π , representing the conservative level, without the sampling error between the empirical Bπ Q and true Bellman backups B π Q. Now we can obtain a more precise bound for the conservative level with the consideration of the sampling error. Following (Auer et al., 2008; Osband et al., 2016; Kumar et al., 2020) , we can relate the empirical Bellman backups Bπ Q and true Bellman backups B π Q as the following: with high probability ≥ 1 -δ, δ ∈ (0, 1), ∀Q, s, a ∈ D, | Bπ Q -B π Q|(s, a) ≤ C r,P ,δ R max (1 -γ) |D(s, a)| , where R max is the upper bound for the reward function (i.e., |r(s, a)| ≤ R max ), C r,P ,δ is a constant relating to the reward function r(s, a), environment dynamic P (•|s, a). More detailed proofs of the relationship are provided in (Kumar et al., 2020) . For the right inequality in Equation 29, we reason the fixed point of the Q-function in ACQL as the following: | Bπ QACQL -B π QACQL |(s, a) ≤ C r,P ,δ R max (1 -γ) |D(s, a)| (33) ⇒ Bπ QACQL ≤ B π QACQL + C r,P ,δ R max (1 -γ) |D(s, a)| (34) ⇒ Bπ QACQL - d µ • µ -d π β • πβ πβ ≤ B π QACQL - d µ • µ -d π β • πβ πβ + C r,P ,δ R max (1 -γ) |D(s, a)| (35) ⇒ Qπ ACQL ≤ B π QACQL - d µ • µ -d π β • πβ πβ + C r,P ,δ R max (1 -γ) |D(s, a)| (36) ⇒ Qπ ACQL ≤ (r + γP π Qπ ACQL ) - d µ • µ -d π β • πβ πβ + C r,P ,δ R max (1 -γ) |D(s, a)| (37) ⇒ Qπ ACQL ≤ (I -γP π ) -1 r - d µ • µ -d π β • πβ πβ + C r,P ,δ R max (1 -γ) |D(s, a)| (38) ⇒ Qπ ACQL ≤ Q π -(I -γP π ) -1 d µ • µ -d π β • πβ πβ + (I -γP π ) -1 C r,P ,δ R max (1 -γ) |D(s, a)| (39) ⇒ Qπ ACQL -Q π ≤ g(s, a) + err(s, a). For the left inequality in Equation 29, we have the similar process as the following:  | Bπ QACQL -B π QACQL |(s, a) ≤ C r,P ,δ R max (1 -γ) |D(s, a)| ⇒ Bπ QACQL ≥ B π QACQL - C r,P ,δ R max (1 -γ) |D(s, a)| (42) ⇒ Bπ QACQL - d µ • µ -d π β • πβ πβ ≥ B π QACQL - d µ • µ -d π β • πβ πβ - C r,P ,δ R max (1 -γ) |D(s, a)| (43) ⇒ Qπ ACQL ≥ B π QACQL - d µ • µ -d π β • πβ πβ - C r,P ,δ R max (1 -γ) |D(s, a)| (44) ⇒ Qπ ACQL ≥ (r + γP π Qπ ACQL ) - d µ • µ -d π β • πβ πβ - C r,P ,δ R max (1 -γ) |D(s, a)| (45) ⇒ Qπ ACQL ≥ (I -γP π ) -1 r - d µ • µ -d π β • πβ πβ - C r,P ,δ R max (1 -γ) |D(s, a)| (46) ⇒ Qπ ACQL ≥ Q π -(I -γP π ) -1 d µ • µ -d π β • πβ πβ -(I -γP π ) -1 C r,P ,δ R max (1 -γ) |D(s, a)| (47) ⇒ Qπ ACQL -Q π ≥ g(s, ∀s ∈ D, a, (α -d µ )µ -(α -d π β )π β π β ≥ 0. ( ) 2) Control over the V-values. The excepted values of learned Q-function Qπ is less conservative than excepted values of the true Q-function Q π , if: ∀s ∈ D, a (α -d µ )µ -(α -d π β )π β π β ≥ 0. (50) 3) Control over the empirical MDP. The learned Q-function Qπ is less conservative over the empirical MDP, if: s∈D a (α -d µ )µ -(α -d π β )π β π β ≥ 0. ( ) Proof of Proposition A.3. As shown in CQL (Kumar et al., 2020) , we first recap the optimization problem of CQL as the following: Qk+1 CQL ← min Q α E s∼D,a∼µ(a|s) [Q(s, a)] -E s∼D,a∼π β (a|s) [Q(s, a)] + 1 2 E s,a,s ′ ∼D Q(s, a) -B π Qk (s, a) 2 . ( ) We can observa that CQL is the a special case of ACQL where the adaptive weight functions d µ (s, a) and d π β (s, a) are both constant α. By setting the derivative of Equation 52 to 0, we can obtain the form of the resulting Q-function Qk+1 in CQL: ∀s ∈ D, a, ∂ Qk+1 CQL ∂Q = 0 (53) ⇒ α • µ -α • π β + π β • (Q -B π Qk ) = 0 (54) ⇒ Qk+1 CQL = B π Qk -α µ -π β π β . ( ) Similar to the proof of Proposition A.1, we calculate the difference between the Q-values of ACQL and CQL in Equations ( 28) and (55) as the following: ∀s ∈ D,a, Qk+1 ACQL -Qk+1 CQL (56) = B π Qk - d µ • µ -d π β • π β π β -B π Qk + α µ -π β π β (57) = (α -d µ )µ -(α -d π β )π β π β . If we want to learn a less conservative Q-function than CQL in ACQL, we need to make the difference in Equation 58 greater than 0, as shown in Equation 49. If we want to relax the conservative level like requiring to control over the V-values or the empirical MDP, we can easily relax the Equation 58 to the integration over each state or the whole empirical MDP as shown in Equations ( 50) and ( 51) respectively. For the conditions where we want to make the learned Q-function Qπ ACQL is more conservative than the CQL Q-function Qπ CQL , we can easily replace the "≥" to "≤" in Proposition A.3. Lemma A.1 For x > 0, ln x ≤ x -1. ln x = x -1 if and only if x = 1. Proof of Lemma A.1. Suppose f (x) = ln x -x + 1, then f ′ (x) = 1-x x . It is easy to know that f (x) ↑ over (0, 1) and f (x) ↓ over (1, +∞). Then f (x) ≤ f (1) = 0.

C ADDITIONAL EXPERIMENTS

C.1 COMPARISONS TO OFFLINE RL BASELINES ON GYM-MUJOCO-V2 DATASETS Aiming to provide a more comprehensive comparison, we compared ACQL to many state-ofthe-art model-free algorithms including behavioral cloning (BC), 10%BC, Decision Transformer (DT) (Chen et al., 2021) , AWAC (Nair et al., 2020) , Onestep RL (Brandfonbrener et al., 2021) , TD3+BC (Fujimoto & Gu, 2021) , IQL (Kostrikov et al., 2021a) and CQL (Kumar et al., 2020) on the version-2 dataset. For the sake of fair comparisons, we directly reported the results of all baselines from the D4RL whitepaper (Fu et al., 2020) and their original papers. Table 7 shows the normalized returns of all 15 Gym-MuJoCo tasks on the version 2 datasets. Due to most baselines did not report their results on "-expert" datasets, we conducted experiments on other 4 types of datasets. We can observe that ACQL consistently outperforms other baselines on all three environments, specifically on the Hopper environment, i.e., achieving normalized returns 302.9 for hopper-sum compared to 268.2 from CQL which is the second rank. From the perspective of the dataset types, ACQL is a very balanced algorithm that achieves excelling results on all kinds of dataset with expert, medium and random data. Furthermore, we find that ACQL can achieve higher performance than other baselines when there are more medium and random data in datasets, we argue it is because ACQL can adaptively calculate the weights for data with different qualities and thus learn a better policy. For instance, while the best result on hopper-random-v2 of other baselines is 9.6 from AWAC, the result of ACQL is 31.4, which is more than 3 times better. Figure 4 plotted the learning curves of ACQL and CQL with 5 kinds of α. From the Figure 4 , one trend we can clearly observe is that CQL with a higher conservative level can usually achieve higher performance on expert datasets as opposed to that a lower conservative level is more effective for random datasets. For instance, the CQL agent with α = 20 (green line) delivered rapidly increasing results as the training epochs increased on the halfcheetah-expert dataset (top left corner), while hovered at the bottom of the plot on the halfcheetah-random dataset (top right corner). Moreover, it is difficult for CQL to achieve satisfactory performance on all kinds of dataset with a fixed α. It is exactly the issue that the ACQL focuses on. Since ACQL generates adaptive weight for each state-action pair and control the conservative level in a more fine-grained way, ACQL can achieve balanced and state-of-the-art results for different dataset types. In addition, in the plot of hoppermedium-replay task (2th row, 2th column), ACQL outperformed CQL agents by a large margin, showing the significance of adaptive weight for each data sample instead of using the same weight α. Table 8 showed the quantitative results of ACQL and CQL with 5 kinds of α. From the Table 8 , one trend we can clearly observe is that CQL with a higher conservative level can usually achieve higher performance on expert datasets as opposed to that a lower conservative level is more effective for random datasets. For instance, the CQL agent with α = 20 delivered high performance value of 96.3 on the halfcheetah-expert dataset, while dropped its performance dramatically to 13.7 on the halfcheetah-random dataset. Moreover, it is difficult for CQL to achieve satisfactory performance on all kinds of dataset with a fixed α. It is exactly the issue that the ACQL focuses on. Since ACQL generates adaptive weight for each state-action pair and control the conservative level in a more fine-grained way, ACQL can achieve balanced and state-of-the-art results for different dataset types. In addition, on the hopper-medium-replay task, ACQL outperformed CQL agents by a large margin, showing the significance of adaptive weight for each data sample instead of using the same weight α.

C.3 COMPARISONS TO OFFLINE RL BASELINES ON ANTMAZE TASKS

The AntMaze tasks mimic real-world robotic navigation tasks that aim to control an "Ant" quadruped robot to reach the goal location from a start location with only 0-1 sparse rewards given. Aiming to provide a more comprehensive comparison, we compared ACQL to many stateof-the-art model-free algorithms including behavioral cloning (BC), 10%BC, Decision Transformer (DT) (Chen et al., 2021) , AWAC (Nair et al., 2020) , Onestep RL (Brandfonbrener et al., 2021) , TD3+BC (Fujimoto & Gu, 2021) , IQL (Kostrikov et al., 2021a) and CQL (Kumar et al., 2020) . For the sake of fair comparisons, we directly reported the results of all baselines from the D4RL whitepaper (Fu et al., 2020) and their original papers. Table 9 shows the normalized returns of all 6 AntMaze tasks. We can observe that ACQL delivers state-of-the-art performance on umaze-diverse-v0 and umaze-v0 datasets, demonstrating its effectiveness on small mazes. However, while IQL can achieve better performance on larger mazes like "-large" and "-medium" tasks, ACQL fails in these scenarios. The experimental results show that it is more difficult for ACQL to tell whether the state-action pairs are good or bad when only 0-1 sparse rewards are given.

C.4 ADDITIONAL ABLATION STUDY

Besides the main results of comparison to other baselines and a detailed comparison to different conservative levels, we also conducted extensive ablation studies for ACQL to study the effect of the proposed losses. We reported the quantitative results on the all Gym-MuJoCo environments including 15 kinds of datasets in Tables 10, 11 and 12. 



Figure 1: Performance gaps of CQL with different conservative levels on HalfCheetah tasks.

(21)    During the training process, we add weight neural networks d µ , d π β and train the weight networks, Q networks and policy networks alternatively. Due to the space limitation, the implementation details are provided in Appendix B.

Figure 2: Learning curves comparing to CQL with different conservative levels on Hopper-v0 environments.

Figure 3, d µ _rand are the d µ (s, a) where s are in dataset and a are sampled randomly, while d µ _policy are the d µ (s, a) where a are predicted from the training policy. We can see the trend that as the quality measurements (green and red points) increase, the adaptive weights d µ (blue and yellow points) decrease, indicating the Q-values of good OOD state-action pairs should be suppressed less. In the right of Figure 3, d π β _dataset are the d π β (s, a) where s, a are sampled from the dataset. As the the quality measurements (green points) increase, the adaptive weights d π β (blue points) also increase, indicating the Q-values of good in-dataset state-action pairs should be higher.

Figure 3: Visualization of the adaptive weights and relative transition quality m(s, a) on HalfCheetah-random-v0 dataset.

a) -err(s, a). (48) Proposition A.3 (The conservative level compared to CQL). For any µ with supp µ ⊂ supp πβ , given the Q-function learned from CQL is Qπ CQL (s, a) = Q π -α µ-π β π β , similiar as in A.1, the conservative level of ACQL compared to CQL can be controlled at least at three different levels according to different conditions: 1) Control over the Q-values. The learned Q-function Qπ is less conservative than the CQL Qfunction Qπ CQL point-wise, if:

Figure 4: Learning curves comparing to CQL with different conservative levels.

Normalized results on D4RL Gym-MuJoCo environments.

Normalized results on D4RL Adroit environments.

Normalized results on D4RL Franka Kitchen environments.

Comparisons of average Q-values over the datasets on HalfCheetah-v0 environments.

Ablation Study on Gym-MuJoCo Hopper-v0 environments in terms of normalized results.

Normalized results on D4RL Gym-MuJoCo environments.To demonstrate more intuitively that ACQL can adaptively control the conservative level, we compared ACQL to CQL with different α values ranging from 1, 2, 5, 10, 20. The different α values represent different conservative levels and a higher α means more conservative.

Normalized results on D4RL Gym-MuJoCo environments.

Normalized results on D4RL AntMaze environments.

Ablation Study on D4RL Gym-MuJoCo HalfCheetah-v0 environments in terms of normalized results. L cl_true L cl_cql L mono L pos halfcheetah-e halfcheetah-m-e halfcheetah-m-r halfcheetah-m halfcheetah-r

Ablation Study on D4RL Gym-MuJoCo Hopper-v0 environments in terms of normalized results. L cl_true L cl_cql L mono L pos hopper-e hopper-m-e hopper-m-r hopper-m hopper-r

Ablation Study on D4RL Gym-MuJoCo Walker environments in terms of normalized results. L cl_true L cl_cql L mono L pos walker-e walker-m-e walker-m-r walker-m walker-r

B EXPERIMENTAL DETAILS

Software. We run our experiments with the following packages and softwares:• d4rl 1.1• Python 3.8.13• Pytorch 1.10.0+cu111 Implementation Details of ACQL. In ACQL, we represent two adaptive weight functions d µ (s, a) and d π β (s, a) by one neural network, which has the same network architecture as the Q-function but the output dimension is 2. During each gradient descent step, we train the weight network, the Q networks and the policy networks in turn. We list all the hyperparameters and network architectures of ACQL in Table 6 .

