OFFLINE REINFORCEMENT LEARNING WITH CLOSED-FORM POLICY IMPROVEMENT OPERATORS Anonymous

Abstract

Behavior constrained policy optimization has been demonstrated to be a successful paradigm for tackling Offline Reinforcement Learning. By exploiting historical transitions, a policy is trained to maximize a learned value function while constrained by the behavior policy to avoid a significant distributional shift. In this paper, we propose our closed-form policy improvement operators. We make a novel observation that the behavior constraint naturally motivates the use of first-order Taylor approximation, leading to a linear approximation of the policy objective. Additionally, as practical datasets are usually collected by heterogeneous policies, we model the behavior policies as a Gaussian Mixture and overcome the induced optimization difficulties by leveraging the LogSumExp's lower bound and Jensen's Inequality, giving rise to a closed-form policy improvement operator. We instantiate offline RL algorithms with our novel policy improvement operators and empirically demonstrate their effectiveness over state-of-the-art algorithms on the standard D4RL benchmark.

1. INTRODUCTION

Deploying Reinforcement Learning (RL) (Sutton & Barto, 2018) in the real world is hindered by its massive demand for online data. In domains such as robotics (Cabi et al., 2019) and autonomous driving (Sallab et al., 2017) , rolling out a premature policy is prohibitively costly and unsafe. To address this issue, offline RL (a.k.a batch RL) (Levine et al., 2020; Lange et al., 2012) has been proposed to learn a policy directly from historical data without environment interaction. However, learning competent policies from a static dataset is challenging. Prior studies have shown that learning a policy without constraining its deviation from the data-generating policies suffers from significant extrapolation errors, leading to training divergence (Fujimoto et al., 2019; Kumar et al., 2019) . Current literature has demonstrated two successful paradigms for managing the trade-off between policy improvement and limiting the distributional shift from the behavior policies. Under the actor-critic framework (Konda & Tsitsiklis, 1999) , behavior constrained policy optimization (BCPO) (Fujimoto et al., 2019; Kumar et al., 2019; Fujimoto & Gu, 2021; Wu et al., 2019; Brandfonbrener et al., 2021; Ghasemipour et al., 2021) explicitly regularizes the divergence between learned and behavior policies, while conservative methods (Kumar et al., 2020b; Bai et al., 2022; Yu et al., 2020; 2021) penalize the value estimate for out-of-distribution (OOD) actions to avoid overestimation error. However, most existing model-free offline RL algorithms still require learning off-policy value functions and a target policy through stochastic gradient descent (SGD). Unlike supervised learning, off-policy learning with non-linear function approximators and temporal difference learning (Sutton & Barto, 2018) is notoriously unstable (Kumar et al., 2020a; Mnih et al., 2015; Henderson et al., 2018; Konda & Tsitsiklis, 1999; Watkins & Dayan, 1992) due to the existence of the deadly-triad (Sutton & Barto, 2018; Van Hasselt et al., 2018) . The performance can exhibit significant variance even across different random seeds (Islam et al., 2017) . In offline settings, learning becomes even more problematic as environment interaction is restricted, thus preventing the learning from receiving corrective feedback (Kumar et al., 2020a) . Consequently, training stability poses a major challenge in offline RL. Although some current approaches (Brandfonbrener et al., 2021) circumvent the requirement for learning an off-policy value function, they still require learning a policy via SGD. Can we mitigate the issue of learning instability by leveraging optimization techniques? In this paper, we approach this issue from the policy learning perspective, aiming to design a stable policy improvement operator. We take a closer look at the BCPO paradigm and make a novel observation that the requirement of limited distributional shift motivates the use of the first-order Taylor approximation (Callahan, 2010) , leading to a linear approximation of the policy objective that is accurate in a sufficiently small neighborhood of the behavior action. Based on this crucial insight, we construct our policy improvement operators that return closed-form solutions by carefully designing a tractable behavior constraint. When modeling the behavior policies as a Single Gaussian, our policy improvement operator deterministically shifts the behavior policy towards a value improving direction derived by solving a Quadratically Constrained Linear Program (QCLP) in closed form. Therefore, our method only requires learning the underlying behavior policies of a given dataset with supervised learning, avoiding the training instability from policy improvement. Furthermore, we note that practical datasets are likely to be collected by heterogeneous policies, which may give rise to a multimodal behavior action distribution. In this scenario, a Single Gaussian will fail to capture the entire picture of the underlying distribution, limiting the potential of policy improvement. While modeling the behavior as a Gaussian Mixture provides better expressiveness, it incurs extra optimization difficulties due to the non-concavity of its log-likelihood function. We tackle this issue by leveraging the LogSumExp's lower bound and Jensen's inequality, leading to a closed-form policy improvement (CFPI) operator compatible with a multimodal behavior policy. Empirically, we demonstrate the effectiveness of Gaussian Mixture over the conventional Single Gaussian when the underlying distribution comes from hetereogenous policies. In this paper, we empirically demonstrate that our CFPI operators can instantiate successful offline RL algorithms in a one-step or iterative fashion. Moreover, our methods can also be leveraged to improve a policy learned by the other algorithms. In summary, our main contributions are threefold: • CFPI operators compatible with single mode and multimodal behavior policies. • An empirical demonstration of the benefit to model the behavior policy as a Gaussian Mixture in model-free offline RL. To the best of our knowledge, we are the first to do this. • One-step and iterative instantiations of our algorithm, which outperform state-of-the-art (SOTA) algorithms on the standard D4RL benchmark (Fu et al., 2020) .

2. PRELIMINARIES

Reinforcement Learning. RL aims to maximize returns in a Markov Decision Process (MDP) (Sutton & Barto, 2018) M = (S, A, R, T, ρ 0 , γ), with state space S, action space A, reward function R, transition function T , initial state distribution ρ 0 , and discount factor γ ∈ [0, 1). At each time step t, the agent starts from a state s t ∈ S, selects an action a t ∼ π(•|s t ) from its policy π, transitions to a new state s t+1 ∼ T (•|s t , a t ), and receives reward r t := R(s t , a t ). The goal of an RL agent is to learn an optimal policy π * that maximizes the expected discounted cumulative reward E π [ ∞ t=0 γ t r t ] without access to the ground truth R and T . We define the action value function associated with π by Q π (s, a) = E π [ ∞ t=0 γ t r t |s 0 = s, a 0 = a]. The RL objective can then be reformulated as π * = arg max π J(π) := E s∈ρ0,a∈π(•|s) [Q π (s, a)] In this paper, we consider offline RL settings, where we assume restricted access to the MDP M, and a previously collected dataset D with N transition tuples {(s i t , a i t , r i t )} N i=1 . We denote the underlying policy that generates D as π β , which may or may not be a mixture of individual policies. Behavior Constrained Policy Optimization. One of the critical challenges in offline RL is that the learned Q function tends to assign spuriously high values to OOD actions due to extrapolation error, which is well documented in previous literature (Fujimoto et al., 2019; Kumar et al., 2019) . Behavior Constrained Policy Optimization (BCPO) methods (Fujimoto et al., 2019; Kumar et al., 2019; Fujimoto & Gu, 2021; Wu et al., 2019; Brandfonbrener et al., 2021) explicitly constrain the action selection of the learned policy to stay close to the behavior policy π β , resulting in a policy improvement step that can be generally summarized by the optimization problem below: max π E s∼D E ã∼π(•|s) [Q (s, ã)] -αD (π(• | s), π β (• | s)) , where D(•, •) is a divergence function that calculates the divergence between two action distributions, and α is a hyper-parameter controlling the strength of regularization. Consequently, the policy is optimized to maximize the Q-value while staying close to the behavior distribution. Different algorithms may choose different D(•, •) (e.g., KL Divergence (Wu et al., 2019; Jaques et al., 2019) , MSE (Fujimoto & Gu, 2021) and MMD (Kumar et al., 2019) ). However, to the best of our knowledge, all existing methods tackle this optimization via SGD. In this paper, we take advantage of the regularization and solve the problem in closed form.

3. CLOSED-FORM POLICY IMPROVEMENT

In this section, we introduce our policy improvement operators that map the behavior policy to a higher-valued policy, which is accomplished by solving a linearly approximated BCPO. We show that modeling the behavior policy as a Single Gaussian transforms the approximated BCPO into a QCLP and thus can be solved in closed-form (Sec. 3.1). Given that practical datasets are usually collected by heterogeneous policies, we generalize the results by modeling the behavior policies as a Gaussian Mixture to facilitate expressiveness and overcome the incurred optimization difficulties by leveraging the LogSumExp's lower bound (LB) and Jensen's Inequality (Sec. 3.2). We close this section by presenting an offline RL paradigm that leverages our policy improvement operators (Sec. 3.3).

3.1. APPROXIMATED BEHAVIOR CONSTRAINED OPTIMIZATION

We aim to design a learning-free policy improvement operator to avoid learning instability in offline settings. We observe that optimizing towards BCPO's policy objective (2) induces a policy that admits limited deviation from the behavior policy. Consequently, it will only query the Q-value within the neighborhood of the behavior action during training, which naturally motivates the employment of the first-order Taylor approximation to derive the following linear approximation of the Q function Q(s, a; a β ) = (a -a β ) T [∇ a Q(s, a)] a=a β + Q(s, a β ) = a T [∇ a Q(s, a)] a=a β + const. By Taylor's theorem (Callahan, 2010), Q(s, a; a β ) only provides an accurate linear approximation of Q(s, a) in a sufficiently small neighborhood of a β . Therefore, the choice of a β is critical. Recognizing (2) as a Lagrangian and with the linear approximation (3), we propose to solve the following surrogate problem of (2) given any state s: max π E ã∼π ãT [∇ a Q(s, a)] a=a β , s.t. D (π(• | s), π β (• | s)) ≤ δ. Note that it is not necessary for D(•, •) to be a (mathematically defined) divergence measure since any generic D(•, •) that can constrain the deviation of π's action from π β can be considered. Single Gaussian Behavior Policy. In general, (4) does not always have a closed-form solution. We analyze a special case where π β = N (µ β , Σ β ) is a Gaussian policy, π = µ is a deterministic policy, and D(•, •) is a negative log-likelihood function. In this scenario, a reasonable choice of µ should concentrate around µ β to limit distributional shift. Therefore, we set a β = µ β and the optimization problem (4) becomes the following: max µ µ T [∇ a Q(s, a)] a=µ β , s.t. -log π β (µ|s) ≤ δ We now show that ( 5) has a closed-form solution. Proposition 3.1. The optimization problem (5) has a closed-form solution that is given by µ sg (τ ) = µ β + √ 2 log τ Σ β [∇ a Q(s, a)] a=µ β [∇ a Q(s, a)] T a=µ β Σ β [∇ a Q(s, a)] a=µ β , where δ = 1 2 log det(2πΣ β ) + log τ (6) Proof sketch. ( 5) can be converted into the QCLP below that has a closed-form solution given by (6) (Boyd et al., 2004) . A fulll proof is given in the Appendix A.1. max µ µ T [∇ a Q(s, a)] a=µ β , s.t. 1 2 (µ -µ β ) T Σ -1 β (µ -µ β ) ≤ δ - 1 2 log det(2πΣ β ) (7) Although we still have to tune τ as tuning α in (2) for conventional BCPO methods, we have a transparent interpretation of τ 's effect on the action selection thanks to the tractability of (5) . Due to the KKT conditions (Boyd et al., 2004) , ( 6) always returns an action µ sg with the following property -log π β (µ|s) = δ = -log 1 τ π β (µ β |s) ⇐⇒ π β (µ sg |s) = 1 τ π β (µ β |s) While setting τ = 1 will always return the mean of π β , a large τ might send µ sg out of the support of π β , breaking the accuracy guarantee of the first-order Taylor approximation.

3.2. GAUSSIAN MIXTURE AS A MORE EXPRESSIVE MODEL

Performing policy improvement with (6) enjoys favorable computational efficiency and avoids the potential instability caused by SGD. However, its tractability relies on the Single Gaussian assumption of the behavior policy π β . In practice, the historical datasets are usually collected by heterogeneous policies with different levels of expertise. A Single Gaussian may fail to capture the whole picture of the underlying distribution, motivating the use of a Gaussian Mixture to represent π β . π β = N i=1 λ i N (µ i , Σ i ), N i=1 λ i = 1 (9) However, directly plugging the Gaussian Mixture π β into (5) breaks its tractability, resulting in a non-convex optimization below max µ µ T [∇ a Q(s, a)] a=a β , s.t. log N i=1 λ i det(2πΣ i ) -1 2 exp - 1 2 (µ -µ i ) T Σ -1 i (µ -µ i ) ≥ -δ We are confronted with two major challenges to solve the optimization problem (10). First, it is unclear how to choose a proper a β while we need to ensure that the solution µ lies within a small neighborhood of a β . Second, the constraint of (10) does not admit a convex form, posing non-trivial optimization difficulties. We leverage the lemma below to tackle the non-convexity of the constraint. Lemma 3.1. log N i=1 λ i exp(x i ) admits the following inequality: 1. (LogSumExp's LB) log N i=1 λ i exp(x i ) ≥ max i {x i + log λ i } 2. (Jensen's Inequality) log N i=1 λ i exp(x i ) ≥ N i=1 λ i x i Next, we show that applying each inequality in Lemma 3.1 to the constraint of (10) respectively resolves the intractability and leads to natural choices of a β . Proposition 3.2. By applying the first inequality of Lemma 3.1 to the constraint of (10), we can derive an optimization problem that lower bounds (10) max µ µ T [∇ a Q(s, a)] a=a β , s.t. max i - 1 2 (µ -µ i ) T Σ -1 i (µ -µ i ) - 1 2 log det(2πΣ i ) + log λ i ≥ -δ, and the closed-form solution to (11) is given by µ lse (τ ) = arg max μi(δ) μT i [∇ a Q(s, a)] a=µi , s.t. δ = min i { 1 2 log det(2πΣ i ) -log λ i } + log τ where μi (δ) = µ i + 2(δ + log λ i ) -log det(2πΣ i ) [∇ a Q(s, a)] T a=µi Σ i [∇ a Q(s, a)] a=µi Σ i [∇ a Q(s, a)] a=µi Proposition 3.3. By applying the second inequality of Lemma 3.1 to the constraint of (10), we can derive an optimization problem that lower bounds (10) and the closed-form solution to (13) is given by max µ µ T [∇aQ(s, a)] a=a β , s.t. N i=1 λi - 1 2 log det(2πΣi) - 1 2 (µ -µi) T Σ -1 i (µ -µi) ≥ -δ (13) µjensen(τ ) = µ + 2 log τ -N i=1 λiµ T i Σ -1 i µi + μT Σ -1 μ [∇aQ(s, a)] T a= μΣ[∇aQ(s, a)] a= μ Σ[∇aQ(s, a)] a= μ, where Σ = N i=1 λiΣ -1 i -1 , μ = Σ N i=1 λiΣ -1 i µi , δ = log τ + 1 2 N i=1 λi log det(2πΣi) We defer the detailed proof of Proposition 3.2 and Proposition 3.3 as well as how we choose a β for each optimization problem to Appendix A.2 and A.3, respectively. Indeed, these two optimization problems have their own assets and liabilities. When π β exhibits an obvious multimodality as is shown in Fig. 1 (L), the lower bound of log π β constructed by Jensen's Inequality cannot capture different modes due to its concavity, losing the advantage of modeling π β as a Gaussian Mixture. In this case, the optimization problem (11) can serve as a reasonable surrogate problem of (10), as LogSumExp's LB still preserves the multimodality of log π β . When π β is reduced to a Single Gaussian, the approximation with the Jensen's Inequality becomes equality as is shown in Fig. 1 (M). Thus µ jensen returned by ( 14) exactly solves the optimization problem (10). However, in this case, the tightness of LogSumExp's LB largely depends on the weights λ i=1...N . If each Gaussian component is distributed and weighted identically, the lower bound will be log N lower than the actual value. Moreover, there also exists the scenario (Fig. 1 (R)) when both (11) and ( 13) can serve as reasonable surrogates to the original problem (10). Fortunately, we can combine the best of both worlds and design a policy improvement operator accounting for all the above scenarios. As both Proposition 3.2 and 3.3 have closed-form solutions, the operator returns a policy that selects the higher-valued action from µ lse and µ jensen µmg(τ ) = arg max µ Q(s, µ), s.t. µ ∈ {µ lse (τ ), µ jensen (τ )} (15)

3.3. ALGORITHM TEMPLATE

We have derived two CFPI operators that map the behavior policy to a higher-valued policy. When the behavior policy π β is a Single Gaussian, I SG (π β , Q; τ ) returns a policy with action selected by (6). When π β is a Gaussian Mixture, I MG (π β , Q; τ ) returns a policy with action selected by (15). We note that our methods can also work with a non-Gaussian π β . Appendix D provides the derivations for the corresponding CFPI operators when π β is modeled as both a deterministic policy and VAE. Algorithm 1 shows that our CFPI operators enable the design of a general offline RL template that can yield one-step, multi-step and iterative methods, where E is a general policy evaluation operator that returns a value function Qt . When setting T = 0, we obtain our one-step method. We defer the discussion on multi-step and iterative methods to the Appendix C. While the design of our CFPI operators is motivated from the behavior constraint, we highlight that they are compatible with general baseline policies π b besides π β . Sec. 5.2 and Appendix G.7 show that our CFPI operators can improve policies learned by IQL and CQL (Kumar et al., 2020b) .

3.4. THEORETICAL GUARANTEES FOR CLOSED-FORM POLICY IMPROVEMENT

At a high level, Algorithm 1 follows the approximate policy iteration (API) (Perkins & Precup, 2002) by iterating over the policy evaluation (E step, Line 4) and policy improvement (I step, Line Get policy: πt+1 = I(π b , Qt ; τ ) (concrete choices of I includes I MG and I SG ) 6: end for 5). Therefore, to verify E provides the improvement, we need to first show policy evaluation Qt is accurate. We employ the Fitted Q-Iteration (Sutton & Barto, 2018) to perform policy evaluation, which is known to be statistically efficient (e.g. (Chen & Jiang, 2019) ) under the mild condition for the function approximation class. Next, for the performance gap between J(π t+1 ) -J(π t ), we apply the standard performance difference lemma (Kakade & Langford, 2002; Kakade, 2003) . Theorem 3.1. [Safe Policy Improvement] Assume the state and action spaces are discrete.foot_0 Let π1 be the policy obtained after the CFPI update (Line 2 of Algorithm 1). Then with probability 1 -δ, J(π 1 ) -J(π β ) ≥ 1 1 -γ E s∼d π1 Qπ β (s, π1 (s)) -Qπ β (s, πβ (s)) - 2 1 -γ E s∼d π1 E a∼π1(•|s) C γ,δ D(s, a) + C CFPI (s, a) := ζ. For multi-step T iterative update, we similarly have with probability 1 -δ,  J(π T ) -J(π β ) = T t=1 J(π t ) -J(π t-1 ) ≥ T t=1 ζ (t) ,

4. RELATED WORK

Our methods belong and are motivated by the successful BCPO paradigm, which imposes constraints as in (2) to prevent from selecting OOD actions. Algorithms from this paradigm may apply different divergence functions, e.g., KL-divergence (Wu et al., 2019; Jaques et al., 2019) , MMD (Kumar et al., 2019) or the MSE (Fujimoto & Gu, 2021) . All these methods perform policy improvement via SGD. Instead, we perform CFPI by solving a linear approximation of (2). Another line of research enforces the behavior constraint via parameterization. BCQ (Fujimoto et al., 2019 ) learns a generative model as the behavior policy and a Q function to select the action from a set of perturbed behavior actions. Ghasemipour et al. (2021) further show that the perturbation model can be discarded. The design of our CFPI operators is inspired by the SOTA online RL algorithm OAC (Ciosek et al., 2019) . It treats the evaluation policy as the baseline π b and obtains an optimistic exploration policy by solving a similar optimization problem as (7). We extend the result to accommodate a multi-modal π b and overcome the optimization difficulties by leveraging Lemma 3.1. In Appendix H, we further draw connections with prior works that leveraged the Taylor expansion approach to RL. Recently, one-step (Kostrikov et al., 2021; Brandfonbrener et al., 2021) algorithms have achieved great success. Instead of iteratively performing policy improvement and evaluation, these methods Table 1 : Comparison between our one-step policy and SOTA methods on the Gym-MuJoCo domain of D4RL. Our method uses the same τ for all datasets except Hopper-M-E (detailed in Appendix F.1). We report the mean and standard deviation of our method's performance across 10 seeds. Each seed contains an individual training process and evaluates the policy for 100 episodes. We use Cheetah for HalfCheetah, M for Medium, E for Expert, and R for Replay. We bold the best results for each task. only learn a Q function via SARSA without bootstrapping from OOD action value. These methods further apply an policy improvement operator (Wu et al., 2019; Peng et al., 2019) to extract a policy. We also instantiate a one-step algorithm with our CFPI operator and evaluate on standard benchmarks.

5. EXPERIMENTS

Our experiments aim to demonstrate the effectiveness of our CFPI operators. Firstly, on the standard offline RL benchmark D4RL (Fu et al., 2020) , we show that instantiating offline RL algorithms with our CFPI operators in both one-step and iterative manners outperforms SOTA methods (Sec. 5.1). Secondly, we show that our CFPI operator can improve a policy learned by other algorithms (Sec. 5.2). Ablation studies in Sec. 5.3 further shows our superiority over the other policy improvement operators and demonstrate the benefit of modeling the behavior policy as a Gaussian Mixture.

5.1. COMPARISON WITH SOTA OFFLINE RL ALGORITHMS

We instantiate a one-step offline RL algorithm from Algorithm 1 with our policy improvement operator I MG . We learned a Gaussian Mixture baseline policy πβ via behavior cloning. We employed the IQN (Dabney et al., 2018a) architecture to model the Q value network for its better generalizability, as we need to estimate out-of-buffer Q(s, a) during policy deployment. We trained the Q0 with SARSA algorithm (Sutton & Barto, 2018; Parisotto et al., 2015) . Appendix F.1 includes detailed training procedures of πβ and Q0 with full HP settings. We obtain our one-step policy as I MG (π β , Q0 ; τ ). We evaluate the effectiveness of our one-step algorithm on the D4RL benchmark focusing on the Gym-MuJoCo domain, which contains locomotion tasks with dense rewards. Table 1 compares our one-step algorithm with SOTA methods, including the other one-step actor-critic methods IQL (Kostrikov et al., 2021) , OneStepRL (Brandfonbrener et al., 2021) , BCPO method TD3+BC (Fujimoto & Gu, 2021) , conservative method CQL (Kumar et al., 2020b) , and trajectory optimization methods DT (Chen et al., 2021) , TT (Janner et al., 2021) . We also include the performance of two behavior policies SG-BC and MG-BC modeled with Single Gaussian and Gaussian Mixture, respectively. We directly report results for IQL, BCQ, TD3+BC, CQL, and DT from the IQL paper, and TT's result from its own paper. Note that OneStepRL instantiates three different algorithms. We only report its (Rev. KL Reg) result because this algorithm follows the BCPO paradigm and achieves the best overall performance. We highlight that OnesteRL reports the results by tuning the HP for each dataset. Results in Table 1 demonstrate that our one-step algorithm outperforms the other algorithms by a significant margin without training a policy to maximize its Q-value through SGD. We note that we use the same τ for all datasets except Hopper-M-E. In Sec. 5.3, we will perform ablation studies and provide a fair comparison between our CFPI operators and the other policy improvement operators. We further instantiate an iterative algorithm with I MG and evaluate its effectiveness on the challenging AntMaze domain of D4RL. The 6 tasks from AntMaze are more challenging due to their sparsereward nature and lack of optimal trajectories in the static datasets. In this section, we show that our CFPI operator I SG can further improve the performance of a Single Gaussian policy π IQL learned by IQL (Kostrikov et al., 2021) on the AntMaze domain. We first obtain the IQL policy π IQL and Q IQL by training for 1M gradient steps using the PyTorch Implementation from RLkit (Berkeley). We emphasize that we follow the authors' exact training and evaluation protocols and include all training curves in Appendix G.6. Interestingly, while the running average of the evaluation results during the course of training matches the reported results in the IQL paper, Table 3 shows that the evaluation of the final 1M-step policy π IQL does not match the reported performance on all 6 tasks, echoing the training instability we are trying to resolve with our CFPI operators. This demonstrates how drastically performance can fluctuate across just dozens of epochs. Thanks to the tractability of I SG , we directly obtain an improved policy I SG (π IQL , Q IQL ; τ ) that achieves better overall performance than both π IQL (train) and (1M), as shown in Table 3 . We tune the HP τ using a small set of seeds for each task following the practice of (Brandfonbrener et al., 2021; Fu et al., 2020) and include more details in Appendix F.2 and G.6.

5.3. ABLATION STUDIES

We first provide a fair comparison with the other policy improvement operators, demonstrating the effectiveness of solving the approximated BCPO (4) and modeling the behavior policy as a Gaussian Mixture. Additionally, we examine the sensitivity on τ , ablate the number of Gaussian components, and discuss the limitation by ablating the Q network in Appendix G.2, G.3, G.4, respectively. Effectiveness of our CFPI operators. In For all methods, we present results with πβ modeled by Single Gaussian (SG-) and Gaussian Mixture (MG-). To ensure a fair comparison, we employ the same Q0 and πβ modeled and learned in the same way as in Sec. 5.1 for all methods. Moreover, we tune N bcq for EBCQ, α for Rev. KL Reg, and 4 . The CIs are estimated using the percentile bootstrap with stratified sampling. Higher median, IQM, and mean scores, and lower Optimality Gap correspond to better performance. Our I MG outperforms baselines by a significant margin based on all four metrics. Appendix E includes additional details. τ for our methods. Each method uses the same set of HP for all datasets. As a result, the Hopper-M-E performance of I MG reported in As is shown in Table 4 and Fig. 2 , our I MG clearly outperforms all baselines by a significant margin. The learning-based method Rev. KL Reg exhibits a substantial amount of variance, again echoing the training instability we are trying to resolve. Moreover, our CFPI operators outperform their EBCQ counterparts, demonstrating the effectiveness of solving the approximated BCPO. Effectiveness of Gaussian Mixture. As the three M-E datasets are collected by an expert and medium policy, we should recover an expert performance as long as we can 1) capture the two modes of the action distribution 2) and always select action from the expert mode. In other words, we can leverage the Q0 learned by SARSA to select actions from the mean of each Gaussian component, resulting in a mode selection algorithm (MG-MS) that selects its action by µ mode = arg max μi Q0(s, μi), s.t. {μi| λi > ξ}, where i=1:N λiN (μi, Σi) = πβ , ξ is set to filter out trivial components. Our MG-MS achieves an expert performance on Hopper-M-E (104.2 ± 5.1) and Walker2d-M-E (104.1 ± 6.7), and matches SOTA algorithms in Cheetah-M-E (91.3 ± 2.1). Appendix G.1 includes the full results of MG-MS on the Gym MuJoCo domain.

6. CONCLUSION AND LIMITATIONS

Motivated by the behavior constraint in the BCPO paradigm, we propose CFPI operators that perform policy improvement by solving an approximated BCPO in closed form. As practical datasets are usually generated by heterogeneous policies, we use the Gaussian Mixture to model the datagenerating policies and overcome extra optimization difficulties by leveraging the LogSumExp's LB and Jensen's Inequality. We instantiate a one-step offline RL algorithm with our CFPI operator and show that it can outperform SOTA algorithms on the Gym-MuJoCo domain of the D4RL benchmark. Our CFPI operators avoid the training instability incurred by policy improvement through SGD. However, our method still requires learning a good Q function. Specifically, our operators rely on the gradient information provided by the Q, and its accuracy largely impacts the effectiveness of our policy improvement. Therefore, one promising future direction for this work is to investigate ways to robustify the policy improvement given a noisy Q.

REPRODUCIBILITY STATEMENT

We include our codes in the supplementary material.

A PROOFS AND THEORETICAL RESULTS

A.1 PROOF OF PROPOSITION 3.1 Proposition 3.1. The optimization problem (5) has a closed-form solution that is given by µ sg (τ ) = µ β + √ 2 log τ Σ β [∇ a Q(s, a)] a=µ β [∇ a Q(s, a)] T a=µ β Σ β [∇ a Q(s, a)] a=µ β , where δ = 1 2 log det(2πΣ β ) + log τ Proof. The optimization problem ( 5) can be converted into the QCLP max µ µ T [∇ a Q(s, a)] a=µ β , s.t. 1 2 (µ -µ β ) T Σ -1 β (µ -µ β ) ≤ δ - 1 2 log det(2πΣ β ) (18) Following a similar procedure as is in OAC (Ciosek et al., 2019) , we first derive the Lagrangian below: L = µ T [∇ a Q(s, a)] a=µ β -η 1 2 (µ -µ β ) T Σ -1 β (µ -µ β ) -δ + 1 2 log det(2πΣ β ) Taking the derivatives w.r.t µ, we get ∇ µ L = [∇ a Q(s, a)] a=µ β -ηΣ -1 β (µ -µ β ) By setting ∇ µ L = 0, we get µ = µ β + 1 η Σ β [∇ a Q(s, a)] a=µ β To satisfy the the KKT conditions (Boyd et al., 2004) , we have η > 0 and (µ -µ β ) T Σ -1 β (µ -µ β ) = 2δ -log det(2πΣ β ) Finally with ( 21) and ( 22), we get η = [∇ a Q(s, a)] T a=µ β Σ β [∇ a Q(s, a)] a=µ β 2δ -log det(2πΣ β ) By setting δ = 1 2 log det(2πΣ β ) + log τ and plugging ( 23) to ( 21), we obtain the final solution as µ sg (τ ) = µ β + √ 2 log τ Σ β [∇ a Q(s, a)] a=µ β [∇ a Q(s, a)] T a=µ β Σ β [∇ a Q(s, a)] a=µ β , which completes the proof. Proposition 3.2. By applying the first inequality of Lemma 3.1 to the constraint of (10), we can derive an optimization problem that lower bounds (10) max µ µ T [∇ a Q(s, a)] a=a β s.t. max i - 1 2 (µ -µ i ) T Σ -1 i (µ -µ i ) - 1 2 log det(2πΣ i ) + log λ i ≥ -δ, and the closed-form solution to (11) is given by µ lse (τ ) = arg max μi(δ) μT i [∇ a Q(s, a)] a=µi , s.t. δ = 1 2 min i {log λ i det(2πΣ i )} + log τ where μi (δ) = µ i + 2(δ + log λ i ) -log det(2πΣ i ) [∇ a Q(s, a)] T a=µi Σ i [∇ a Q(s, a)] a=µi Σ i [∇ a Q(s, a)] a=µi Proof. Recall that the Gaussian Mixture behavior policy is constructed by π β = N i=1 λ i N (µ i , Σ i ), We first divide the optimization problem (25) into N sub-problems, with each sub-problem i given by max µ µ T [∇ a Q(s, a)] a=a β s.t. - 1 2 (µ -µ i ) T Σ -1 i (µ -µ i ) - 1 2 log det(2πΣ i ) + log λ i ≥ -δ, which is equivalent to solving problem (5) for each Gaussian component with an additional constant term log λ i , and thus has a unique closed-form solution. Define the maximizer for each sub-problem i as μi (δ), though μi (δ) does not always exist. Whenever -1 2 log det(2πΣ i ) + log λ i < -δ, there will be no µ satisfying the constraint as 1 2 (µ -µ i ) T Σ -1 i (µµ i ) is always greater than 0. We thus set μi (δ) to be None in this case. Next, we will show that there does not exist any μ / ∈ {μ i (δ)|i = 1 . . . N }, s.t., μ is the maximizer of (25). We can show this by contradiction. Suppose there exists a μ / ∈ {μ i (δ)|i = 1 . . . N } maximizing (25), there exists at least one j ∈ {1, . . . , N } s.t. - 1 2 (μ -µ j ) T Σ -1 j (μ -µ j ) - 1 2 log det(2πΣ j ) + log λ j ≥ -δ. ( ) Since μ is the maximizer of ( 25), it should also be maximizer of the sub-problem j. However, the maximizer for sub-problem j is given by μj (δ) ̸ = μ, contradicting with the fact that μ is the maximizer of the sub-problem j. Therefore, the optimal solution to (25) has to be given by arg max μi μT i [∇ a Q(s, a)] a=a β where μi ∈ {μ i (δ)|i = 1 . . . N } To solve each sub-problem i, it is natural to set a β = µ i , which reformulate the sub-problem i as below max µ µ T [∇ a Q(s, a)] a=µi s.t. 1 2 (µ -µ i ) T Σ -1 i (µ -µ i ) ≤ δ - 1 2 log det(2πΣ i ) + log λ i , Note that problem (31) is also a QCLP similar to the problem (5) . Therefore, we can derive its solution by following similar procedures as in Appendix A.1, resulting in μi (δ) = µ i + 2(δ + log λ i ) -log det(2πΣ i ) [∇ a Q(s, a)] T a=µi Σ i [∇ a Q(s, a)] a=µi Σ i [∇ a Q(s, a)] a=µi . We complete the proof by further setting δ = 1 2 min i {log λ i det(2πΣ i )} + log τ . Proposition 3.3. By applying the second inequality of Lemma 3.1 to the constraint of (10), we can derive an optimization problem that lower bounds (10) max µ µ T [∇ a Q(s, a)] a=a β s.t. N i=1 λ i - 1 2 log det(2πΣ i ) - 1 2 (µ -µ i ) T Σ -1 i (µ -µ i ) ≥ -δ and the closed-form solution to (13) is given by µ jensen (τ ) = μ + 2 log τ - N i=1 λ i µ T i Σ -1 i µ i + μT Σ -1 μ [∇ a Q(s, a)] T a=μ Σ[∇ a Q(s, a)] a=μ Σ[∇ a Q(s, a)] a=μ , where Σ = N i=1 λ i Σ -1 i -1 , μ = Σ N i=1 λ i Σ -1 i µ i , δ = log τ + 1 2 N i=1 λ i log det(2πΣ i ) Proof. Note that problem ( 33) is also a QCLP. Before deciding the value of a β , we first derive its Lagrangian with a general a β below L = µ T [∇ a Q(s, a)] a=a β -η N i=1 λ i 1 2 log det(2πΣ i ) + 1 2 (µ -µ i ) T Σ -1 i (µ -µ i ) -δ Taking the derivatives w.r.t µ, we get ∇ µ L = [∇ a Q(s, a)] a=a β -η N i=1 λ i Σ -1 i (µ -µ i ) By setting ∇ µ L = 0, we get µ = N i=1 λ i Σ -1 i -1 N i=1 λ i Σ -1 i µ i + 1 η N i=1 λ i Σ -1 i [∇ a Q(s, a)] a=a β = μ + 1 η Σ [∇ a Q(s, a)] a=a β , where Σ = N i=1 λ i Σ -1 i -1 , μ = Σ N i=1 λ i Σ -1 i µ i , Equation 37 shows that the final solution to the problem (33) will be a shift from the pseudo-mean μ. Therefore, setting a β = μ becomes a natural choice. Furthermore, by satisfying the KKT conditions, we have η > 0 and N i=1 λ i (µ -µ i ) T Σ -1 i (µ -µ i ) = 2δ - N i=1 λ i log det(2πΣ i ) Plugging ( 33) into (38) gives the equation below N i=1 λ i μ + 1 η Σ [∇ a Q(s, a)] a=μ -µ i T Σ -1 i μ + 1 η Σ [∇ a Q(s, a)] a=μ -µ i = 2δ - N i=1 λ i log det(2πΣ i ). The LHS of (39) can be reformulated as N i=1 λ i μ + 1 η Σ [∇ a Q(s, a)] a=μ -µ i T Σ -1 i μ + 1 η Σ [∇ a Q(s, a)] a=μ -µ i = 1 η 2 N i=1 λ i Σ [∇ a Q(s, a)] a=μ T Σ -1 i Σ [∇ a Q(s, a)] a=μ + 2 η N i=1 λ i Σ [∇ a Q(s, a)] a=μ T Σ -1 i μ -µ i + N i=1 λ i (μ -µ i ) T Σ -1 i (μ -µ i ) . ( ) We note that the second line of (40)'s RHS can be reduced to 2 η N i=1 λ i Σ [∇ a Q(s, a)] a=μ T Σ -1 i μ -µ i = 2 η Σ [∇ a Q(s, a)] a=μ T N i=1 λ i Σ -1 i μ - N i=1 λ i Σ -1 i µ i = 2 η Σ [∇ a Q(s, a)] a=μ T Σ -1 μ -Σ -1 Σ N i=1 λ i Σ -1 i µ i = 2 η Σ [∇ a Q(s, a)] a=μ T Σ -1 μ -Σ -1 μ = 0 . Therefore, (40) can be further reformulated as N i=1 λ i μ + 1 η Σ [∇ a Q(s, a)] a=μ -µ i T Σ -1 i μ + 1 η Σ [∇ a Q(s, a)] a=μ -µ i = 1 η 2 N i=1 λ i Σ [∇ a Q(s, a)] a=μ T Σ -1 i Σ [∇ a Q(s, a)] a=μ + N i=1 λ i (μ -µ i ) T Σ -1 i (μ -µ i ) = 1 η 2 Σ [∇ a Q(s, a)] a=μ T N i=1 λ i Σ -1 i Σ [∇ a Q(s, a)] a=μ + N i=1 λ i (μ -µ i ) T Σ -1 i (μ -µ i ) = 1 η 2 Σ [∇ a Q(s, a)] a=μ T Σ -1 Σ [∇ a Q(s, a)] a=μ + N i=1 λ i (μ -µ i ) T Σ -1 i (μ -µ i ) = 1 η 2 [∇ a Q(s, a)] T a=μ Σ [∇ a Q(s, a)] a=μ + N i=1 λ i (μ -µ i ) T Σ -1 i (μ -µ i ) . ( ) To this point, (39) can be reformulated as 1 η 2 [∇ a Q(s, a)] T a=μ Σ [∇ a Q(s, a)] a=μ + N i=1 λ i (μ -µ i ) T Σ -1 i (μ -µ i ) = 2δ - N i=1 λ i log det(2πΣ i ) We can thus express η as below η = [∇ a Q(s, a)] T a=μ Σ [∇ a Q(s, a)] a=μ 2δ - N i=1 λ i log det(2πΣ i ) - N i=1 λ i (μ -µ i ) T Σ -1 i (μ -µ i ) (44) By setting δ = 1 2 N i=1 λ i log det(2πΣ i ) + log τ , we have η = [∇ a Q(s, a)] T a=μ Σ [∇ a Q(s, a)] a=μ 2 log τ - N i=1 λ i (μ -µ i ) T Σ -1 i (μ -µ i ) = [∇ a Q(s, a)] T a=μ Σ [∇ a Q(s, a)] a=μ 2 log τ - N i=1 λ i μT Σ -1 i μ + 2μ T N i=1 λ i Σ -1 i µ i - N i=1 λ i µ T i Σ -1 i µ i = [∇ a Q(s, a)] T a=μ Σ [∇ a Q(s, a)] a=μ 2 log τ - N i=1 μT Σ -1 μ + 2μ T Σ -1 μ - N i=1 λ i µ T i Σ -1 i µ i = [∇ a Q(s, a)] T a=μ Σ [∇ a Q(s, a)] a=μ 2 log τ + μT Σ -1 μ - N i=1 λ i µ T i Σ -1 i µ i . Finally, plugging (45) into (37), with a β = μ, we have µ jensen (τ ) = μ + 2 log τ - N i=1 λ i µ T i Σ -1 i µ i + μT Σ -1 μ [∇ a Q(s, a)] T a=μ Σ[∇ a Q(s, a)] a=μ Σ[∇ a Q(s, a)] a=μ , which completes the proof. A.4 PROOF OF THEOREM 3.1 In this section we prove the safe policy improvement presented in Section 3.3. Algorithm 1 follows the approximate policy iteration (API) (Perkins & Precup, 2002) by iterating over the policy evaluation (E step, Line 4) and policy improvement (I step, Line 5). Therefore, to verify E provides the improvement, we need to first show policy evaluation Qt is accurate. In particular, we focus on the SARSA updates (Line 2), which is a form of on-policy Fitted Q-Iteration (Sutton & Barto, 2018). Fortunately, it is known that FQI is statistically efficient (e.g. (Chen & Jiang, 2019 )) under the mild condition for the function approximation class. Its linear counterpart, least-square value iteration, is also shown to be efficient for offline reinforcement learning (Jin et al., 2021; Yin et al., 2022) . Recently, (Zou et al., 2019) shows the finite sample convergence guarantee for SARSA under the standard the mean square error loss. Next, to show the performance improvement, we leverage the performance difference lemma to show our algorithm achieves the desired goal. Lemma A.1 (Performance Difference Lemma). For any policy π, π ′ , it holds that J(π) -J(π ′ ) = 1 1 -γ E s∼d π E a∼π(•|s) A π ′ (s, a) , where A π (s, a) = Q π (s, a) -V π (s) is the advantage function. Similar to (Kumar et al., 2020b) , we focus on the discrete case where the number of states |S| and actions |A| are finite (note in the continuous case, the D(s, a) would be 0 for most locations, and thus the bound becomes less interesting). The adaptation to the continuous space can leverage standard techniques like state abstraction (Li et al., 2006) and covering arguments. Next, we define the learning coefficient C γ,δ of SARSA as | Qπ β (s, a) -Q πβ (s, a)| ≤ C γ,δ D(s, a) , ∀s, a ∈ S × A. Define the first-order approximation error as Qπ β (s, a) := (a -a β ) T ∇ a Qπ β (s, a) a=a β + Qπ β (s, a β ), then the approximation error is defined as: C CFPI (s, a) := | Qπ β (s, a)-Qπ β (s, a)| = (a -a β ) T ∇ a Qπ β (s, a) a=a β + Qπ β (s, a β ) -Qπ β (s, a) . Under the constraint D (π(• | s), πβ (• | s)) ≤ δ (4) (or equivalently action a is close to a β ), the first-order approximation provides a good estimation for the Qπ β . Theorem A.1 (Restatement of Theorem 3.1). Assume the state and action spaces are discrete. Let π1 be the policy obtained after the CFPI update (Line 2 of Algorithm 1). Then with probability 1 -δ, J(π 1 ) -J(π β ) ≥ 1 1 -γ E s∼d π1 Qπ β (s, π1 (s)) -Qπ β (s, πβ (s)) - 2 1 -γ E s∼d π1 E a∼π1(•|s) C γ,δ D(s, a) + C CFPI (s, a) := ζ. For multi-step T iterative update, we similarly have with probability 1 -δ, t) , where D(s, a) denotes number of samples at s, a, the learning coefficient of SARSA is defined J(π T ) -J(π β ) = T t=1 J(π t ) -J(π t-1 ) ≥ T t=1 ζ as C γ,δ = max s0,a0 2 ln(12SA/δ) • ∞ h=0 s,a γ 2h • µ πβ h (s, a|s 0 , a 0 ) 2 Var [V πβ (s ′ ) | s, a], and C CFPI (s, a) denotes the error from the first-order approximation (3), (4) using CFPI, i.e. C CFPI (s, a) := (a -a β ) T ∇ a Qπ β (s, a) a=a β + Qπ β (s, a β ) -Qπ β (s, a) . When a = a β , C CFPI (s, a) = 0. proof of Theorem 3.1. We focus on the first update, which is from πb to π1 . According to the Sarsa update, we have | Qπ β (s, a) -Q πβ (s, a)| ≤ C γ,δ √ D(s,a) , ∀s, a ∈ S × A with probability 1 -δ and this is due to previous on-policy evaluation result (e.g. (Zou et al., 2019) ). Also denote π1 := arg max π Qπ β . By Lemma A.1, J(π 1 ) -J(π β ) = 1 1 -γ E s∼d π1 E a∼π1(•|s) A πβ (s, a) = 1 1 -γ E s∼d π1 E a∼π1(•|s) [Q πβ (s, a) -V πβ (s)] = 1 1 -γ E s∼d π1 E a∼π1(•|s) [Q πβ (s, a) -Q πβ (s, πβ (s))] ≥ 1 1 -γ E s∼d π1 E a∼π1(•|s) [ Qπ β (s, a) -Qπ β (s, πβ (s))] - 2 1 -γ E s∼d π1 E a∼π1(•|s) C γ,δ D(s, a) ≥ 1 1 -γ E s∼d π1 Qπ β (s, π1 (s)) -Qπ β (s, πβ (s)) - 2 1 -γ E s∼d π1 E a∼π1(•|s) C γ,δ D(s, a) + C CFPI (s, a) :=ζ (1) where the first inequality uses | Qπ β (s, a) -Q πβ (s, a)| ≤ C γ,δ √ D(s,a) and the last inequality uses π1 := arg max π Qπ β . Here C γ,δ = max s0,a0 2 ln(12SA/δ) • ∞ h=0 s,a γ 2h • µ πβ h (s, a|s 0 , a 0 ) Var [V πβ (s ′ ) | s, a] Similarly, if the number of iteration t > 1, then Denote C (t) γ,δ := max s0,a0 2 ln(12SA/δ) • ∞ h=0 s,a γ 2h • µ πt h (s, a|s 0 , a 0 ) 2 µ πt-1 h (s, a|s 0 , a 0 ) Var [V πt (s ′ ) | s, a], then we have with probability 1 -δ, by the Corollary 1 of Duan et al. (2020) , the OPE estimation follows | Qπ β (s, a) -Q πβ (s, a)| ≤ C (t) γ,δ D(s, a) and J(π t ) -J(π t-1 ) ≥ 1 1 -γ E s∼d πt Qπt-1 (s, πt (s)) -Qπt-1 (s, πt-1 (s)) - 2 1 -γ E s∼d πt E a∼πt(•|s) C (t) γ,δ D(s, a) + C CFPI (s, a) := ζ (t) , then for multi-step iterative algorithm, by a union bound, we have with probability 1 -δ t) . J(π T ) -J(π β ) = T t=1 J(π t ) -J(π t-1 ) ≥ T t=1 ζ On the learning coefficient of SARSA. The learning of SARSA is known to be statistically efficient from existing off-policy evaluation (OPE) literature, for instance (Duan et al., 2020; Yin & Wang, 2020) . This is due to the on-policy SARSA scheme is just a special case of OPE task by choosing π = πβ . Concretely, we can translate the finite sample error bound in Corollary 1 of (Duan et 2020) to the infinite horizon discounted setting as: for any initial state,action s 0 , a 0 , with probability 1 -δ, | Qπ β (s 0 , a 0 )-Q πβ (s 0 , a 0 )| ≤ 1 D(s 0 , a 0 ) 2 ln(12/δ)• ∞ h=0 s,a γ 2h • µ πβ h (s, a|s 0 , a 0 ) Var [V πβ (s ′ ) | s, a] Note the original statement in (Duan et al., 2020) is for v πβ -vπ β , here we conduct the version for Qπ β -Q πβ instead and this can be readily obtained by fixing the initial state action s 0 , a 0 for v π . As a result, by a union bound (over S, A) it is valid to define C γ,δ = max s0,a0 2 ln(12SA/δ) • ∞ h=0 s,a γ 2h • µ πβ h (s, a|s 0 , a 0 ) Var [V πβ (s ′ ) | s, a] and this makes sure the statistical guarantee in Theorem 3.1 follows through. Similarly, for the multi-step case, the OPE estimator hold with the corresponding coefficient C (t) γ,δ := max s0,a0 2 ln(12SA/δ) • ∞ h=0 s,a γ 2h • µ πt h (s, a|s 0 , a 0 ) 2 µ πt-1 h (s, a|s 0 , a 0 ) Var [V πt (s ′ ) | s, a]. Lastly, even the assumption on the state-action space to be finite is not essential for Theorem 3.1 since, for more general function approximations, recent literature for OPE (Zhang et al., 2022) shows SARSA update in Algorithm 1 is still statistically efficient.

B DETAILED PROCEDURES TO OBTAIN EQUATION 15

We first highlight that we set the HP δ differently for Proposition 3.2 and 3.3. With the same τ , we generate the two different δ for the two different settings. Specifically, δ lse (τ ) = log τ + min i 1 2 log det(2πΣ i ) -log λ i , (Proposition 3.2) δ jensen (τ ) = log τ + 1 2 N i=1 λ i log det(2πΣ i ), (Proposition 3.3) We next provide intuition for the design choices (47). Recall that the Gaussian Mixture behavior policy is constructed by π β = N i=1 λ i N (µ i , Σ i ). With the mixture weights λ i=1...N , we define the scaled probability πi (a) of the i-th Gaussian component evaluated at a πi (µ i ) = λ i π i (a) = λ i det(2πΣ i ) -1 2 exp{- 1 2 (a -µ i ) T Σ -1 i (a -µ i )}, where π i (a) = N (a; µ i , Σ i ) denotes the probability of the i-th Gaussian component evaluated at a. Therefore, we can have log πi (µ i ) = log λ i -1 2 log det(2πΣ i ), which implies that δ lse (τ ) = log τ + min i 1 2 log det(2πΣ i ) -log λ i = -max i log λ i - 1 2 log det(2πΣ i ) -log τ = -max i log 1 τ πi (µ i ) . . By setting δ lse (τ ) in this way, μj = μj (δ lse (τ )) will satisfy the following condition whenever μj is a valid solution to the sub-problem j (28) due to the KKT conditions, ∀j ∈ {1, . . . , N }. - 1 2 (μ j -µ j ) T Σ -1 j (μ j -µ j ) - 1 2 log det(2πΣ j ) + log λ j = -δ lse (τ ) ⇐⇒ log πj (μ j ) = max i log 1 τ πi (µ i ) ⇐⇒ πj (μ j ) = 1 τ max i {π i (µ i )} To elaborate the design of δ jensen (τ ), we first recall that the constraint of problem ( 13) is given by N i=1 λ i - 1 2 log det(2πΣ i ) - 1 2 (µ -µ i ) T Σ -1 i (µ -µ i ) ≥ -δ jensen (τ ). Note that the LHS of ( 52) is a concave function w.r.t µ. Thus, we can obtain its maximum by setting its derivatives (53) to zero ∇ µ N i=1 λ i - 1 2 log det(2πΣ i ) - 1 2 (µ -µ i ) T Σ -1 i (µ -µ i ) = - N i=1 λ i Σ -1 i (µ -µ i ) = -Σ -1 µ + Σ -1 μ . ( ) Interestingly, we can find that the solution is given by µ = μ. Plugging µ = μ into the LHS of (52), we can obtain its maximum as below - 1 2 N i=1 λ i log det(2πΣ i ) - 1 2 N i=1 λ i (μ -µ i ) T Σ -1 i (μ -µ i ) ≤ N i=1 λ i - 1 2 log det(2πΣ i ) = N i=1 λ i log π i (µ i ) The inequality holds as the covariance matrix Σ i is a positive semi-definite matrix for i ∈ {1 . . . N }. Therefore, our choice of δ jensen (τ ) can be interpreted as δ jensen (τ ) = log τ + 1 2 N i=1 λ i log det(2πΣ i ) = -( N i=1 λ i log π i (µ i ) -log τ ) (55) Algorithm 2 Iterative I MG 1: Input: Learned behavior policy πβ , Q network parameters ϕ 1 , ϕ 2 , target Q network parameters ϕ targ,1 , ϕ targ,2 , dataset D, parameter τ 2: repeat 3: Randomly sample a batch of transitions, B = {(s, a, r, s ′ , d)} from D 4: Compute target actions a ′ (s ′ ) = clip I MG (π β , Q; τ )(s ′ ) + clip(ϵ, -c, c), a Low , a High , where Q = min(Q ϕ1 , Q ϕ2 ), and ϵ ∼ N (0, σ) 5: Compute targets y(r, s ′ , d) = r + γ(1 -d) min i=1,2 Q ϕtarg,i (s ′ , a ′ (s ′ )) 6: Update Q-functions by one step of gradient descent using ∇ ϕi 1 |B| (s,a,r,s ′ ,d)∈B (Q ϕi (s, a) -y(r, s ′ , d)) 2 for i = 1, 2 7: Update target networks with ϕ targ,i ← ρϕ targ,i + (1 -ρ)ϕ i for i = 1, 2 8: until convergence 9: Output: I MG (π β , Q; τ ) C MULTI-STEP AND ITERATIVE ALGORITHMS By setting T > 0, we can derive multi-step and iterative algorithms. Thanks to the tractability of our CFPI operators I SG and I MG , we can always perform the policy improvement step in-closed form. Therefore, there is no significant gap between multi-step and iterative algorithms with our CFPI operators. One can differentiate our multi-step and iterative algorithms by whether an algorithm trains the policy evaluation step E( Qt-1 , πt , D) to convergence or not. As for the policy evaluation operator E, the fitted Q evaluation (Ernst et al., 2005; Le et al., 2019; Fujimoto et al., 2022) with a target network (Mnih et al., 2015) has been demonstrated to be an effective and successful paradigm to perform policy evaluation (Kumar et al., 2019; Fujimoto & Gu, 2021; Haarnoja et al., 2018; Lillicrap et al., 2015; Fujimoto et al., 2018) in deep (offline) RL. When instantiating a multi-step or iterative algorithm from Algorithm 1, one can also consider the other policy evaluation operators by incorporating more optimization techniques. In the rest of this section, we will instantiate an iterative algorithm with our CFPI operators performing the policy improvement step and evaluate its effectiveness on the challenging AntMaze domains.

C.1 ITERATIVE ALGORITHM WITH OUR CFPI OPERATORS

In Sec. 5.1, we instantiate an iterative algorithm Iterative I MG with our CFPI operator I MG . Algorithm 2 presents the corresponding pseudo-codes that learn a set of Q-function networks for simplicity. Without loss of generality, we can easily generalize the algorithm to learn the action-value distribution Z(s, a) as is defined in (58). For each task, we learn a Gaussian Mixture behavior policy πβ with behavior cloning. Similar to Sec. 5.1, we employed the IQN (Dabney et al., 2018a) architecture to model the Q-value network for its better generalizability. As our CFPI operator I MG returns a deterministic policy, we follow the TD3 (Fujimoto et al., 2018) to perform policy smoothing by adding noise to the action a ′ (s ′ ) in Line 4. After convergence, Algorithm 2 outputs an improved policy I MG (π β , Q; τ ). Table 5 compares our Iterative I MG with SOTA algorithms on the AntMaze domain. The performance for all baseline methods is directly reported from the IQL paper (Kostrikov et al., 2021) . Our method Table 5 : Comparison between our iterative algorithm and SOTA methods on the AntMaze domain of D4RL. We report the mean and standard deviation across 5 seeds for our methods. Our Iterative I MG outperforms all baselines on 5 out of 6 tasks and obtaining the best overall performance, demonstrating the effectiveness of our CFPI operator when instantiating an iterative algorithm. "u" stands for umaze, "m" stands for medium, "l" stands for large, "p" stands for play, and "d" stands for diverse. outperforms all baseline methods on on 5 out of 6 tasks and obtaining the best overall performance.

Dataset

The training curves are shown in Fig. 3 with the HP settings detailed in Table 6 . We did not perform much HP tuning, and thus one should expect a performance improvement after conducting fine-grained HP tuning. Algorithm 3 details the procedures to perform the policy improvement step for a stochastic behavior policy π β . We first obtain its EBCQ policy π EBCQ in Line 1-2. As π EBCQ is deterministic, we further plug it in I DET in Line 3 to return an improved policy.

D.3 EXPERIMENT RESULTS

To evaluate the performance of I DET , we first learn two behavior policies with two different models. Specifically, we model π det with a three-layer MLP that outputs a deterministic policy and π vae with the Variational auto-encoder (VAE) (Kingma & Welling, 2013) from BCQ (Fujimoto et al., 2019) . Moreover, we reused the same value function Q0 as in Section 5.1. We present the results in Table 7 . DET-BC and VAE denote the performance of π det and π vae , respectively. VAE-EBCQ denotes the EBCQ performance of π vae with M = 50 candidate actions. Since π det is deterministic, its EBCQ performance is the same as DET-BC. As for our two methods, we set δ = 0.1 for all datasets. We can observe that both our I DET with π det and I DET with π vae largely improve over the baseline methods. Moreover, I DET with π vae outperforms VAE-EBCQ by a significant margins on all three datasets, demonstrating the effectiveness of our CFPI operator. Indeed, our method benefits from an accurate and expressive behavior policy, as I DET with π vae achieves a higher average performance compared to I DET with π det , while maintaining a lower standard deviation on all three datasets. We also note that we did not spend too much effort optimizing the HP, e.g., the VAE architectures, learning rates, and the value of τ .

E RELIABLE EVALUATION TO ADDRESS THE STATISTICAL UNCERTAINTY

Figure 4 : Comparison between our methods and baselines using reliable evaluation methods proposed in (Agarwal et al., 2021) . We re-examine the results in Table 4 on the 9 tasks from the D4RL MuJoCo Gym domain. Each metric is calculated with a 95% CI bootstrap based on 9 tasks and 10 seeds for each task. Each seed further evaluates each method for 100 episodes. The interquartile mean (IQM) discards the top and bottom 25% data points and calculates the mean across the remaining 50% runs. The IQM is more robust as an estimator to outliers than the mean while maintaining less variance than the median. Higher median, IQM, mean scores, and lower Optimality Gap correspond to better performance. Our I MG outperforms the baseline methods by a significant margin based on all four metrics. Figure 5 : Performance profiles (score distributions) for all methods on the 9 tasks from the D4RL MuJoCo Gym domain. The average score is calculated by averaging all runs within one task. Each task contains 10 seeds, and each seed evaluates for 100 episodes. Shaded area denotes 95% confidence bands based on percentile bootstrap and stratified sampling (Agarwal et al., 2021) . The η value where the curves intersect with the dashed horizontal line y = 0.5 corresponds to the median, while the area under the performance curves corresponds to the mean. To demonstrate the superiority of our methods over the baselines and provide reliable evaluation results, we follow the evaluation protocols proposed in (Agarwal et al., 2021) to re-examine the results in Table 4 . Specifically, we adopt the evaluation methods for all methods with N tasks × N seeds runs in total. Moreover, we obtain the performance profile of each method, revealing its score distribution and variability. In particular, the score distribution shows the fraction of runs above a certain threshold η and is given by F (η) = F (η; x 1:Ntasks,1:Nseeds ) = 1 N tasks Ntasks m=1 1 N seeds Nseeds n=1 1 [x m,n ≥ η] Evaluation results in Fig. 4 and Fig. 5 demonstrate that our I MG outperforms the baseline methods by a significant margin based on all four reliable metrics.

F HYPER-PARAMETER SETTINGS AND TRAINING DETAILS

For all methods we proposed in Table 1, Table 3, and Table 4 , we obtain the mean and standard deviation of each method across 10 seeds. Each seed contains individual training process and evaluates the policy for 100 episodes.

F.1 HP AND TRAINING DETAILS FOR METHODS IN TABLE 1 AND TABLE 4

Table 8 includes the HP of methods evaluated on the Gym-MuJoCo domain. We use the Adam (Kingma & Ba, 2014) optimizer for all learning algorithms and normalize the states in each dataset following the practice of TD3+BC (Fujimoto & Gu, 2021) . Note that our one-step offline RL algorithms presented in Table 1 (Our I MG ) and Table 4 (I MG , I SG , MG-EBCQ, SG-EBCQ, MG-MS) require learning a behavior policy and the value function Q0 . Therefore, we will first describe the detailed procedures for learning Single Gaussian (SG-BC) and Gaussian Mixture (MG-BC) behavior policies. We next describe our SARSA-style training procedures to estimate Q0 . Finally, we will present the details for each one-step algorithm.  I SG (π IQL , Q IQL ) log τ selected from {0.1, 0.2, 2.0} Table 13 : Hyperparameters for methods in Table 3 Table 13 includes the HP for experiments in Sec. 5.2. The of IQL. We use the same HP for the IQL training as is reported in the IQL paper. We obtain the IQL policy π IQL and Q IQL by training for 1M gradient steps using the PyTorch Implementation from RLkit (Berkeley), a widely used RL library. We emphasize that we follow the authors' exact training and evaluation protocol. We include the training curves for all tasks from the AntMaze domain in Appendix G.6. Note that IQL (Kostrikov et al., 2021) reported inconsistent offline experiment results on AntMaze in its paper's Table 1 , Table 2 , Table 5 , and Table 6 foot_1 . We suspect that these results are obtained from different sets of random seeds. In Appendix G.6, we present all these results in Table 17 . To obtain the performance for I SG (π IQL , Q IQL ), we follow the practice of (Brandfonbrener et al., 2021; Fu et al., 2020) and perform a grid search on log τ ∈ {0.1, 0.2, 2.0} using 3 seeds for each dataset. We then evaluate the best choice for each dataset by obtaining corresponding results on the other 7 seeds. We finally report the results with 10 seeds in total.

G ADDITIONAL EXPERIMENTS G.1 COMPLETE EXPERIMENT RESULTS FOR MG-MS

Table 14 provides the results of MG-MS on the 9 tasks from the MuJoCo Gym domain in compensation for the results in Sec. 5.3. Fig. 6 presents additional results in compensation for the results in Sec. 5.3. We note that Hopper-Medium-Expert-v2 requires a much smaller log τ than the other tasks to perform well. Figure 6 : Performance of I MG with varying log τ . The other HP can be found in Table 8 . Each variant averages returns over 10 seeds, and each seed contains 100 evaluation episodes. The shaded area denotes bootstrapped 95% CI.

G.3 ABLATION STUDY ON THE NUMBER GAUSSIAN COMPONENTS

In this section, we explore whether increasing the number of Gaussian components will result in a performance boost. We use the same settings as in Table 1 except modeling πβ with 8 Gaussian instead of 4. We hypothesize the performance gain should most likely happen on the three Medium-Replay datasets, as these datasets are collected by diverse policies. However, Table 15 shows that simply increasing the number of Gaussian components from 4 to 8 hardly results in a performance boost, as increasing the number of Gaussian components will induce extra optimization difficulties during behavior cloning (Jin et al., 2016) . (66) Fig. 9 presents the L val on each dataset from the Gym-MuJoCo domain. We can clearly observe that Qβ generally overfits the D train when training for too many gradient steps. We evaluate over two folds G.7 IMPROVE THE POLICY LEARNED BY CQL In this section, we show that our CFPI operators can also improve the policy learned by CQL (Kumar et al., 2020b) on the MuJoCo Gym Domain. We first obtain the CQL policy π CQL and Q CQL by training for 1M gradient steps using the official CQL implementation 5 . We obtain an improved policy I SG (π CQL , Q CQL ; τ ) that slightly outperforms π CQL overall, as shown in Table 18 . For all 6 tasks, we set log τ = 0.1. 

H ADDITIONAL RELATED WORK

There has been a history of leveraging the Taylor expansion to construct efficient RL algorithms. Kakade & Langford (2002) proposed the conservative policy iteration that optimizes a mixture of policies towards its policy objective's lower bound, which is constructed by performing first-order Taylor expansion on the mixture coefficient. Later, SOTA deep RL algorithms TRPO (Schulman et al., 2015) and PPO (Schulman et al., 2017) extend the results to work with trust region policy constraints and learn a stochastic policy parameterized by a neural network. More recently, Tang et al. (2020) developed a second-order Taylor expansion approach under similar online RL settings. At a high level, both our works and previous methods propose to create a surrogate of the original policy objective by leveraging the Taylor expansion approach. However, our motivation to use Taylor expansion is fundamentally different from the previous works (Kakade & Langford, 2002; Schulman et al., 2015; 2017; Tang et al., 2020) , which leverage the Taylor expansion to construct a lower bound of the policy objective so that optimizing towards the lower bound translates into guaranteed policy improvement. However, these methods do not result in a closed-form solution to the policy and still require iterative policy updates. On the other hand, our method leverages the Taylor expansion to construct a linear approximation of the policy objective, enabling the derivation of a closed-form solution to the policy improvement step and thus avoiding performing policy improvement via SGD. We highlight that our closed-form policy update cannot be possible without directly optimizing the parameter of the policy distribution. In particular, the parameter should belong to the action space. We note that this is a significant conceptual difference between our method and previous works. Specifically, PDL (Kakade & Langford, 2002) parameterizes the mixture coefficient of a mixture policy as θ. TRPO (Schulman et al., 2015) and PPO (Schulman et al., 2017) set θ as the parameter of a neural network that outputs the parameters of a Gaussian distribution. In contrast, our methods learn deterministic policy π(s) = Dirac(θ(s)) and directly optimize the parameter θ(s). We aim to learn a greedy π by solving θ(s) = arg max a Q(s, a). However, obtaining a greedy π in continuous control is problematic (Silver et al., 2014) . Given the requirement of limited distribution shift in the offline RL, we thus leverage the first-order Taylor expansion to relax the problem into a more tractable form  ) where Q is defined in Equation 3. By modeling π β as a Single Gaussian or Gaussian Mixture, we further transform the problem into a QCLP and thus derive the closed-form solution. Finally, we note that both the trust region methods TRPO and PPO and our methods constrain the divergence between the learned policy and behavior policy. However, the behavior policy always remains unchanged in our offline RL settings. As TRPO and PPO are designed for the online RL tasks, the updated policy will be used to collect new data and becomes the new behavior policy in future training iteration.



Note here we assume the discreteness only for the purpose of analysis. For the more general cases, please refer to Appendix A.4. In Theorem 3.1, µ π h (s, a|s0, a0) := P π (s h = s, a h = a, |s0 = s, a0 = a). Link to the IQL paper. IQL's Table5& 6 are presented in the supplementary material. https://github.com/ikostrikov/implicit_q_learning Link to the IQL paper. IQL's Table & are presented in the supplementary material. https://github.com/aviralkumar2907/CQL



Figure 1: Apply Lemma 3.1 to Gaussian Mixture's log probability log π β at different scenarios. (L) log π β has multiple modes. LogSumExp's LB preserves multimodality. (M) log π β reduces to Single Gaussian. Jensen's inequality becomes equality. (R) log π β is similar to a uniform distribution.

Offline RL with closed-form policy improvement operators (CFPI) Input: Dataset D, baseline policy πb , value function Q-1 , HP τ 1: Warm start Q0 = SARSA( Q-1 , D) with the SARSA-style algorithm (Sutton & Barto, 2018) 2: Get one-step policy π1 = I(π b , Q0 ; τ ) 3: for t = 1 . . . T do 4: Policy evaluation: Qt = E( Qt-1 , πt , D) 5:

where D(s, a) denotes number of samples at s, a, C γ,δ denotes the learning coefficient of SARSA and C CFPI (s, a) denotes the first-order approximation error from (3). We defer detailed derivation and the expression of C γ,δ , ζ (t) and C CFPI (s, a) in Appendix A.4. When a = a β , C CFPI (s, a) = 0. By Theorem 3.1, π1 is a ζ-safe improved policy. The ζ safeness consists of two parts: C CFPI is caused by the first-order approximation, and the C γ,δ / D(s, a) term is incurred by the SARSA update. Similarly, πT is a T t=1 ζ (t) -safe improved policy.

Figure 2: Aggregate metrics (Agarwal et al., 2021) with 95% CIs based on results reported in Table4. The CIs are estimated using the percentile bootstrap with stratified sampling. Higher median, IQM, and mean scores, and lower Optimality Gap correspond to better performance. Our I MG outperforms baselines by a significant margin based on all four metrics. Appendix E includes additional details.

Figure 3: Iterative I MG training results on AntMaze. Shaded area denotes one standard deviation.

Figure 7: Performance of I MG with varying ensemble sizes. Each variant averages returns over 8 seeds, and each seed contains 100 evaluation episode. Each Q-value network is modeled by a 3-layer MLP. The shaded area denotes bootstrapped 95% CI.

Figure 8: Performance of I MG with varying ensemble sizes on Walker2d-Medium-Replay-v2. Each variant aggregates returns over 8 seeds, and each seed evaluates for 100 episodes. Each Q-value network is modeled by a 3-layer MLP. With lower ensemble size, the performance exhibits large variance across different episodes.

a; a β ), s.t. -log π β (a|s) ≤ δ, (

Comparison between our Iterative I MG and SOTA methods on the AntMaze domain. We report the mean and standard deviation across 5 seeds for our method with each seed evaluating for 100 episodes. The performance for all baselines is directly reported from the IQL paper. Our Iterative I MG outperforms all baselines on 5 out of 6 tasks and obtains the best overall performance.

Table 2 compares our Iterative I MG with SOTA algorithms on the AntMaze domain. Our method uses the same set of HP for all 6 tasks, outperforming all baselines on 5 out of 6 tasks and obtaining the best overall performance. Appendix C.1 presents additional details with pseudo-codes and training curves.

Our I SG (π IQL , Q IQL ) improves over the policy π IQL learned by IQL on AntMaze. We report the mean and standard deviation 10 seeds. Each seed evaluates for 100 episodes.



Ablation studies of our Method on the Gym-MuJoCo domain. Again we report the mean and std of 10 seeds, each seed evaluates for 100 episodes.

Algorithm 3 Policy improvement of I DET with a stochastic π β Input: State s, stochastic policy π β , value function Q, δ, number of candidate actions to sample M 1: Sample candidate actions {a 1 , . . . , a M } from π β 2: Obtain the EBCQ policy π EBCQ with action selected by π EBCQ (s) = arg max m=1...M Q(s, a m ) 3: Return I DET (π EBCQ , Q; δ) by calculating (57) I DET results on the Gym-MuJoCo domain. We report the mean and standard deviation 5 seeds and each seed evaluates for 100 episodes.

Hyperparameters for our methods in Table1 and Table 4.

HP search for MG-EBCQ. We report the mean and std of 10 seeds, and each seed evaluates for 100 episodes.

HP search for MG-Rev. KL Reg. We report the mean and std of 10 seeds, and each seed evaluates for 100 episodes.

HP search for SG-Rev. KL Reg. We report the mean and std of 10 seeds, and each seed evaluates for 100 episodes.



Results of MG-MS on the MuJoCo Gym domain. We report the mean and standard deviation across 10 seeds, and each seed evaluates for 100 episodes.

Comparison between setting the number of Gaussian components to 4 and 8 for our I MG on the three Medium-Replay datasets. We report the mean and standard deviation across 10 seeds, and each seed evaluates for 100 episodes.

Improving the policy learned by IQL with our CFPI operator I SG Dataset π CQL (1M) I SG (π CQL , Q CQL )

Appendix OUTLINE OF THE APPENDIX

In this Appendix, we organize the content in the following ways:• Appendix A presents the missing proofs for Proposition 3.1, 3.2, 3.3 and Theorem 3.1 in the main paper.• Appendix B justify one HP setting for Equation 15.• Appendix C discusses how to instantiate multi-step and iterative algorithms from our algorithm template Algorithm 1.• Appendix D provides the derivation of a new CFPI operator that can work with both deterministic and VAE policy.• Appendix E conducted a reliable evaluation to demonstrate the statistical significance of our methods and address statistical uncertainty.• Appendix F gives experiment details, HP settings and corresponding experiment results.• Appendix G provides additional ablation studies and experiment results.• Appendix H includes additional related work and discusses the relationship between our method and prior literature that leverage the Taylor expansion approach.Our experiments are conducted on various types of 8GPUs machines. Different machines may have different GPU types, such as NVIDIA GA100 and TU102 

D CFPI BEYOND GAUSSIAN POLICIES

In the main paper, we mainly discuss the scenario when the behavior policy π β is from the Gaussian family and develop two CFPI operators. However, our methods can also work with a non-Gaussian π β . Next, we derive a new CFPI operator I DET that can work with deterministic π β . We then show that I DET can also be leveraged to improve a general stochastic policy π β without knowing its actual expression, as long as we can sample from it.

D.1 DETERMINISTIC BEHAVIOR POLICY

When modeling both π = µ and π β = µ β as deterministic policies, we can derive the following BCPO from the problem (4) by setting D(•, •) as the mean squared error.Problem ( 56) has a similar form as the problem (18). We can thus obtain its closed-form solution µ = µ det (δ) as belowTherefore, we can derive a new CFPI operator I DET (π β , Q; δ) that returns a policy with action selected by (57).We further note that the problem (56) can be seen as a linear approximation of the objectives used in TD3 + BC (Fujimoto & Gu, 2021) .

D.2 BEYOND DETERMINISTIC BEHAVIOR POLICY

Though we assume π β to be a deterministic policy during the derivation of I DET , we can indeed leverage I DET to tackle the more general case when we can only sample from π β without knowing its actual expression.via expectation maximization (Jordan & Jacobs, 1994; Xu et al., 1994; Jin et al., 2016) or variational Bayes (Bishop & Svensén, 2012), we empirically find that directly minimizing the negative loglikelihood of actions sampled from the offline datasets achieves satisfactory performance, as is shown in Table 1 . We train the policy for 500K gradient steps. We emphasize that we do not aim to propose a better algorithm for learning a Gaussian Mixture behavior policy. Instead, future work may use a more advanced algorithm to capture the underlying behavior policy better.SG-BC. We parameterize the policy as a 3-layer MLP, which outputs the tanh of a Single Gaussian with the state-dependent diagonal covariance matrix (Fu et al., 2020; Haarnoja et al., 2018) . We train the policy for 500K gradient steps.SARSA. We parameterize the value function with the IQN (Dabney et al., 2018a) architecture and train it to model the distribution Z β : S × A → Z of the behavior return via quantile regression, where Z is the action-value distributional space (Ma et al., 2020) defined asWe define the CDF function of Z β as F Z β (z) = P r(Z β < z), leading to the quantile function (Müller, 1997 )} as the inverse CDF function, where ρ denotes the quantile fraction. We further denote Z β ρ = F -1 Z β (ρ) to ease the notation.To obtain Z β , we leverage the empirical distributional bellman operator T β D : Z → Z defined as T β D Z(s, a) :where A : D = B implies the random variables A and B are governed by the same distribution. We note that T β D helps to construct a Huber quantile regression loss (Dabney et al., 2018a; Ma et al., 2020; Dabney et al., 2018b) , and we can finally learn Z β by minimizing the quantile regression loss following a similar procedures as in (Ma et al., 2020) .To achieve the goal, we approximate Z β by N q quantile fractions {ρ i ∈ [0, 1] | i = 0 . . . N q } with ρ 0 = 0, ρ Nq = 1 and ρ i < ρ j , ∀i < j. We further denote ρi = (ρ i + ρ i+1 )/2, and use random sampling (Dabney et al., 2018a) to generate the quantile fractions. By further parameterizing Z β ρ (s, a) as Ẑβ ρ (s, a; θ) with parameter θ, we can derive the loss function J Z (θ) aswhereθ is the parameter of the target network (Lillicrap et al., 2015) given by the Polyak averaging of θ.We refer interested readers to (Dabney et al., 2018a; Ma et al., 2020) for further details.The training procedures above returns Ẑβ ρ , ∀ρ ∈ [0, 1]. With the learned Ẑβ ρ , our one-step methods presented in Table 1 and Table 4 extract Since our methods still need to query out-of-buffer action values during rollout, we employed the conventional double Q-learning (Fujimoto et al., 2018) technique to prevent potential overestimation without clipping. Specifically, we initialize Q1 0 and Q2 0 differently and train them to minimize (60). With the learned Q1 0 and Q2 0 , we set the value of Q0 (s, a) asOur I MG (Table 1 ). Recall that our CFPI operator I MG (π β , Q0 ; τ ) requires to learn a Gaussian Mixture behavior policy πβ and a value function Q0 . We train πβ and Q0 according to the procedures listed in MG-BC and SARSA, respectively. By following the practice of (Brandfonbrener et al., 2021; Fu et al., 2020) , we perform a grid search on log τ ∈ {0, 0.5, 1.0, 1.5, 2.0} using 3 seeds. We note that we manually reduce I MG to MG-MS when log τ = 0 by only considering the mean of each non-trivial Gaussian component. Our results show that setting log τ = 0.5 achieves the best overall performance while Hopper-M-E requires an extremely small log τ to perform well as is shown in Appendix G.2. Therefore, we decide to set log τ = 0 for Hopper-M-E and log τ = 0.5 for the other 8 datasets. We then obtain the results for the other 7 seeds with these HP settings and report the results on the 10 seeds in total.I MG (Table 4 ) & I SG (Table 4 ). Different from the results in Table 1 , we use the same log τ = 0.5 for all datasets including Hopper-M-E to obtain the performance of I MG in Table 4 . In this way, we aim to understand the effectiveness of each component of our methods better. To fairly compare I MG and I SG , we tune the τ for I SG in a similar way by performing a grid search on log τ ∈ {0.5, 1.0, 1.5, 2.0} with 3 seeds and finally set log τ = 0.5 for all datasets. We then obtain the results for the other 7 seeds and report the results with 10 seeds in total.MG-EBCQ & SG-EBCQ. We tune the number of candidate actions N bcq from the same range {2, 5, 10, 20, 50, 100} as is in (Brandfonbrener et al., 2021) . For each N bcq , we obtain its average performance for all tasks across 10 seeds and select the best performing N bcq for each method. We separately tune the N bcq for MG-EBCQ and SG-EBCQ. As a result, we set N bcq = 5 for MG-EBCQ and N bcq = 10 for SG-EBCQ. Moreover, we highlight that MG-EBCQ (SG-EBCQ) uses the same behavior policy and value function as is in I MG (I SG ). We include the full hyper-parameter search results in Table 9 and Table 10 .where θk is the parameter of a target network given by the Polyak averaging of θ. We set QMLP (s, a; θ k ) = QMLP θ k (s, a). We further note that Equation 62 can be reformulated aswhere μQ and σQ calculate the mean and standard deviation of Q value (Ciosek et al., 2019) . In the case with an ensemble of Q, we obtain Q0 (s, a) by generalizing (64) as belowwhere μMLPQMLP (s, a; θ k ).(65)Other than the Q-value network, we applied the same setting as I MG in Table 4 . Fig. 7 presents the results with different ensemble sizes, showing that the performance generally increases with the ensemble size. Such a phenomenon illustrates a limitation of our CFPI operator I MG , as it heavily relies on accurate gradient information ∇ a [ Q0 (s, a)] a=a β .A large ensemble of Q is more likely to provide accurate gradient information, thus leading to better performance. In contrast, a small ensemble size provides noisy gradient information, resulting in high variance across different rollout, as is shown in Fig. 8 . with one seed. Therefore, we can decide the gradient steps of each dataset for the SARSA training according to the results in Fig. 9 as listed in Table 16 . We note that the IQL paper 4 does not report consistent results in their paper for the offline experiment performance on the AntMaze, as is shown in Table 17 . We suspect that these results are obtained from different sets of random seeds. Therefore, we can conclude that our reproduced results match the results reported in the IQL paper. We believe our reproduction results of IQL are reasonable, even if we do not use the official implementation open-sourced by the authors. 

