BEHAVIOR PROXIMAL POLICY OPTIMIZATION

Abstract

Offline reinforcement learning (RL) is a challenging setting where existing off-policy actor-critic methods perform poorly due to overestimating of out-ofdistribution state-action pairs. Thus, various additional augmentations are proposed to keep the learned policy close to the offline dataset (or the behavior policy). In this work, starting from the analysis of offline monotonic policy improvement, we reach a surprising conclusion that online on-policy algorithms are naturally able to solve offline RL. Specifically, the inherent conservatism of these on-policy algorithms is exactly what the offline RL method needs to overcome the overestimation. Based on this, we propose Behavior Proximal Policy Optimization (BPPO), which solves offline RL without any extra constraint or regularization introduced compared to PPO. Extensive experiments on the D4RL benchmark empirically show this extremely succinct method outperforms state-of-the-art offline RL algorithms.

1. INTRODUCTION

Typically, reinforcement learning (RL) is thought of as a paradigm for online learning, where the agent interacts with the environment to collect experiences and then uses them to improve itself (Sutton et al., 1998) . This online process poses the biggest obstacles to real-world RL applications because of expensive or even risky data collection in some fields (such as navigation (Mirowski et al., 2018) and healthcare (Yu et al., 2021a) ). As an alternative, offline RL eliminates the online interaction and learns from a fixed dataset collected by some arbitrary and possibly unknown process (Lange et al., 2012; Fu et al., 2020) . The prospect of this data-driven mode (Levine et al., 2020) is pretty encouraging and has been placed with great expectations for solving real-world RL applications. Unfortunately, the major superiority of offline RL, the lack of online interaction, also raises another challenge. The classical off-policy iterative algorithms tend to underperform due to overestimating out-of-distribution (shorted as OOD) state-action pairs, even though offline RL can be viewed as an extreme off-policy case. More specifically, when Q-function poorly estimates the value of OOD state-action pairs during policy evaluation, the agent tends to take OOD actions with erroneously estimated high values, resulting in low-performance after policy improvement (Fujimoto et al., 2019) . Thus, to overcome the overestimation issue, some solutions keep the learned policy close to the behavior policy (or the offline dataset) (Fujimoto et al., 2019; Wu et al., 2019; Fujimoto & Gu, 2021) . Most offline RL algorithms adopt online interactions to select hyperparameters. This is because offline hyperparameter selection, which selects hyperparameters without online interactions, is always an open problem lacking satisfactory solutions (Paine et al., 2020; Zhang & Jiang, 2021) . Deploying the policy learned by offline RL is potentially risky in certain areas (Mirowski et al., 2018; Yu et al., 2021a) since the performance is unknown. However, the risk during online interactions will be greatly reduced if the deployed policy can guarantee better performance than the behavior policy. This inspires us to consider how to use offline dataset to improve behavior policy with a monotonic performance guarantee. We formulate this problem as offline monotonic policy improvement. To analyze offline monotonic policy improvement, we introduce the Performance Difference Theorem (Kakade & Langford, 2002) . During analysis, we find that the offline setting does make the monotonic policy improvement more complicated, but the way to monotonically improve policy remains unchanged. This indicates the algorithms derived from online monotonic policy improvement (such as Proximal Policy Optimization) can also achieve offline monotonic policy improvement. In other words, PPO can naturally solve offline RL. Based on this surprising discovery, we propose Behavior Proximal Policy Optimization (BPPO), an offline algorithm that monotonically improves behavior policy in the manner of PPO. Owing to the inherent conservatism of PPO, BPPO restricts the ratio of learned policy and behavior policy within a certain range, similar to the offline RL methods which make the learned policy close to the behavior policy. As offline algorithms are becoming more and more sophisticated, TD3+BC (Fujimoto & Gu, 2021) , which augments TD3 (Fujimoto et al., 2018) with behavior cloning (Pomerleau, 1988) , reminds us to revisit the simple alternatives with potentially good performance. BPPO is such a "most simple" alternative without introducing any extra constraint or regularization on the basis of PPO. Extensive experiments on the D4RL benchmark (Fu et al., 2020) empirically shows that BPPO outperforms state-of-the-art offline RL algorithms.

2. PRELIMINARIES

2.1 REINFORCEMENT LEARNING Reinforcement Learning (RL) is a framework of sequential decision. Typically, this problem is formulated by a Markov decision process (MDP) M = {S, A, r, p, d 0 , γ}, with state space S, action space A, scalar reward function r, transition dynamics p, initial state distribution d 0 (s 0 ) and discount factor γ (Sutton et al., 1998) . The objective of RL is to learn a policy, which defines a distribution over action conditioned on states π (a t |s t ) at timestep t, where a t ∈ A, s t ∈ S. Given this definition, the trajectory τ = (s 0 , a 0 , • • • , s T , a T ) generated by the agent's interaction with environment M can be described as a distribution P π (τ ) = d 0 (s 0 ) T t=0 π (a t |s t ) p (s t+1 |s t , a t ), where T is the length of the trajectory, and it can be infinite. Then, the goal of RL can be written as an expectation under the trajectory distribution J (π) = E τ ∼Pπ(τ ) T t=0 γ t r(s t , a t ) . This objective can also be measured by a state-action value function Q π (s, a), the expected discounted return given the action a in state s: Q π (s, a) = E τ ∼Pπ(τ |s,a) T t=0 γ t r(s t , a t )|s 0 = s, a 0 = a . Similarly, the value function V π (s) is the expected discounted return of a certain state s: V π (s) = E τ ∼Pπ(τ |s) T t=0 γ t r(s t , a t )|s 0 = s . Then, we can define the advantage function: A π (s, a) = Q π (s, a) -V π (s).

2.2. OFFLINE REINFORCEMENT LEARNING

In offline RL, the agent only has access to a fixed dataset with transitions D = (s t , a t , s t+1 , r t )

N t=1

collected by a behavior policy π β . Without interacting with environment M, offline RL expects the agent to infer a policy from the dataset. Behavior cloning (BC) (Pomerleau, 1988) , an approach of imitation learning, can directly imitate the action of each state with supervised learning: πβ = argmax π E (s,a)∼D [log π (a|s)] . Note that the performance of πβ trained by behavior cloning highly depends on the quality of transitions, also the collection process of behavior policy π β . In the rest of this paper, improving behavior policy actually refers to improving the estimated behavior policy πβ , because π β is unknown.

2.3. PERFORMANCE DIFFERENCE THEOREM

Theorem 1. (Kakade & Langford, 2002) Let the discounted unnormalized visitation frequencies as ρ π (s) = T t=0 γ t P (s t = s|π) and P (s t = s|π) represents the probability of the t-th state equals to s in trajectories generated by policy π. For any two policies π and π ′ , the performance difference J ∆ (π ′ , π) ≜ J (π ′ ) -J (π) can be measured by the advantage function: J ∆ (π ′ , π) = E τ ∼P π ′ (τ ) T t=0 γ t A π (s t , a t ) = E s∼ρ π ′ (•),a∼π ′ (•|s) [A π (s, a)] . ( ) Derivation detail is presented in Appendix A. This theorem implies that improving policy from π to π ′ can be achieved by maximizing (2). From this theorem, Trust Region Policy Optimization (TRPO) (Schulman et al., 2015a) is derived, which can guarantee the monotonic improvement of performance. We also apply this theorem to formulate offline monotonic policy improvement.

3. OFFLINE MONOTONIC IMPROVEMENT OVER BEHAVIOR POLICY

In this section, we theoretically analyze offline monotonic policy improvement based on Theorem 1, namely improving the πβ generated by behavior cloning (1) with offline dataset D. Applying the Performance Difference Theorem to the estimated behavior policy πβ , we can get J ∆ (π, πβ ) = E s∼ρπ(•),a∼π(•|s) A πβ (s, a) . Maximizing this equation can obtain a policy better than behavior policy πβ . But the above equation is not tractable due to the dependence of the new policy's state distribution ρ π (s). For standard online method, ρ π (s) is replaced by the old state distribution ρ πβ (s). But in the offline setting, ρ πβ (s) cannot be obtained through interactions with the environment like the online situation. We use the state distribution recovered by the offline dataset ρ D (s) for replacement, where ρ D (s) = T t=0 γ t P (s t = s|D) and P (s t = s|D) represents the probability of the t-th state equals to s in the offline dataset. Therefore, the approximation of J ∆ (π, π β ) can be written as: J ∆ (π, πβ ) = E s∼ρ D (•),a∼π(•|s) A πβ (s, a) . To measure the difference between J ∆ (π, πβ ) and its approximation J ∆ (π, πβ ), we introduce a midterm E s∼ρ πβ (s),a∼π(•|s) A πβ (s, a) with the state distribution ρ πβ (s). During the proof, the commonly-used total variational divergence D T V (π∥π β ) [s] = 1 2 E a |π (a|s) -πβ (a|s)| between policy π, πβ at state s is necessary. For the total variational divergence between the offline dataset D and the estimated behavior policy πβ , it may not be straightforward. We can view the offline dataset D = (s t , a t , s t+1 , r t ) N t=1 as a deterministic distribution, and then the distance is: Proposition 1. For offline dataset D = (s t , a t , s t+1 , r t ) N t=1 and policy πβ , the total variational divergence can be expressed as D T V (D∥π β ) [s t ] = 1 2 (1 -πβ (a t |s t )). Detailed derivation process is presented in Appendix B. Now we are ready to measure the difference: Theorem 2. Given the distance D T V (π∥π β ) [s] and D T V (D∥π β ) [s] = 1 2 (1 -πβ (a t |s t )) , we can derive the following bound: J ∆ (π, πβ ) ≥ J ∆ (π, πβ ) -4γA πβ • max s D T V (π∥π β ) [s] • E s∼ρ πβ (•) [D T V (π∥π β ) [s]] -2γA πβ • max s D T V (π∥π β ) [s] • E s∼ρ D (•) [1 -πβ (a|s)] , here A πβ = max s,a A πβ (s, a) . The proof is presented in Appendix C. Compared to the theorem in the online setting (Schulman et al., 2015a; Achiam et al., 2017; Queeney et al., 2021) , the second right term of Equation ( 5) is similar while the third term is unique for the offline. E s∼ρ D (•) [1 -πβ (a|s)] represents the difference caused by the mismatch between offline dataset D and πβ . When πβ is determined, this term is one constant. And because the inequality max s D T V (π∥π β ) [s] ≥ E s∼ρ πβ (•) [D T V (π∥π β ) [s] ] holds, we can claim the following conclusion:

Conclusion 1

To guarantee the true objective J ∆ (π, πβ ) non-decreasing, we should simultaneously maximize E s∼ρ D(•) ,a∼π(•|s) A πβ (s, a) and minimize [max s D T V (π∥π β ) [s]], which means the offline dataset D is capable of monotonically improving the estimated behavior policy πβ . Suppose we have improved the behavior policy πβ and get a policy π k . The above theorem only guarantees that π k has a higher performance than πβ but π k may not be optimal. If the offline dataset D can still improve the policy π k to get a better policy π k+1 , π k+1 must be closer to the optimal policy. Thus, we further analyze the monotonic policy improvement over policy π k . Applying Performance Difference Theorem 1 to the policy π k , J ∆ (π, π k ) = E s∼ρπ(•),a∼π(•|s) [A π k (s, a)] . To approximate the above equation, the common manner is replacing ρ π with the old policy state distribution ρ π k . But in the offline RL, π k is forbidden from acting in the environment. As a result, the state distribution ρ π k is impossible to estimate. Thus, the only choice without any other alternative is replacing ρ π k by the state distribution from the offline dataset D: J ∆ (π, π k ) = E s∼ρ D (•),a∼π(•|s) [A π k (s, a)] . Intuitively, this replacement is reasonable if π k , πβ are similar which means this approximation must be related to the distance D T V (π k ∥π β ) [s]. Concretely, the gap can be formulated as follows: Theorem 3. Given the distance D T V (π∥π k ) [s], D T V (π k ∥π β ) [s] and D T V (D∥π β ) [s] = 1 2 (1 -πβ (a|s)) , we can derive the following bound: J ∆ (π, π k ) ≥ J ∆ (π, π k ) -4γA π k • max s D T V (π∥π k ) [s] • E s∼ρπ k (•) [D T V (π∥π k ) [s]] -4γA π k • max s D T V (π∥π k ) [s] • E s∼ρ πβ (•) [D T V (π k ∥π β ) [s]] -2γA π k • max s D T V (π∥π k ) [s] • E s∼ρ D (•) [1 -πβ (a|s)] , here A π k = max s,a |A π k (s, a)|. The proof is presented in Appendix D. Compared to the theorem 2, one additional term related to the distance of π k , πβ has been introduced. The distance E s∼ρ πβ (•) [D T V (π k ∥π β ) [s] ] is irrelevant to the target policy π which can also be viewed as one constant. Besides, theorem 2 is a specific case of this theorem if π k = πβ . Thus, we set π 0 = πβ since πβ is the first policy to be improved and in the following section we will no longer deliberately distinguish πβ , π k . Similarly, we can derive the following conclusion:

Conclusion 2

To guarantee the true objective J ∆ (π, π k ) non-decreasing, we should simultaneously maximize E s∼ρ D (•),a∼π(•|s) [A π k (s, a)] and minimize [max s D T V (π∥π k ) [s]], which means the offline dataset D is capable of monotonically improving the policy π k , where k = 0, 1, 2, • • • .

4. BEHAVIOR PROXIMAL POLICY OPTIMIZATION

In this section, we derive a practical algorithm based on the theoretical results. And surprisingly, the loss function of this algorithm is the same as the online on-policy method Proximal Policy Optimization (PPO) (Schulman et al., 2017) . Furthermore, this algorithm highly depends on the behavior policy so we name it as Behavior Proximal Policy Optimization, shorted as BPPO. According to the Conclusion 2, to monotonically improve policy π k , we should jointly optimize: Maximize π E s∼ρ D (•),a∼π(•|s) [A π k (s, a)] & Minimize π max s D T V (π∥π k ) [s], here k = 0, 1, 2, • • • and π 0 = πβ . But minimizing the total divergence between π and π k results in a trivial solution π = π k which is impossible to make improvement over π k . A more reasonable optimization objective is to maximize J ∆ (π, π k ) while constraining the divergence: Maximize π E s∼ρ D (•),a∼π(•|s) [A π k (s, a)] s.t. max s D T V (π∥π k ) [s] ≤ ϵ. ( ) For the term to be maximized, we adopt importance sampling to make the expectation only depends on the action distribution of the old policy π k rather than the new policy π: E s∼ρ D (•),a∼π(•|s) [A π k (s, a)] = E s∼ρ D (•),a∼π k (•|s) π (a|s) π k (a|s) A π k (s, a) . In this way, we could estimate this term by sampling states from offline the dataset s ∼ ρ D (•) then sampling actions with old policy a ∼ π k (•|s). For the total variational divergence, we rewrite it as max s D T V (π∥π k ) [s] = max s 1 2 a |π (a|s) -π k (a|s)| da = max s 1 2 a π k (a|s) π (a|s) π k (a|s) -1 da = 1 2 max s E a∼π k (•|s) π (a|s) π k (a|s) -1 . ( ) In the offline setting, only states s ∼ ρ D (•) are available and other states are inaccessible. So the operation max s can also be expressed as max s∼ρ D (•) . When comparing Equation ( 11) and ( 12), we find that the state distribution, the action distribution and the policy ratio appear in both. Thus we consider how to insert the divergence constraint into Equation ( 11). The following constraints are equivalent: max s∼ρ D (•) D T V (π∥π k ) [s] ≤ ϵ ⇐⇒ max s∼ρ D (•) E a∼π k (•|s) π (a|s) π k (a|s) -1 ≤ 2ϵ ⇐⇒ max s∼ρ D (•) E a∼π k (•|s) clip π (a|s) π k (a|s) , 1 -2ϵ, 1 + 2ϵ , clip (x, l, u) = min (max (x, l) , u) . (13) Here the max operation is impractical to solve, so we adopt a heuristic approximation (Schulman et al., 2015a) that changes max into expectation. Then divergence constraint (13) can be inserted: L k (π) = E s∼ρ D (•),a∼π k (•|s) min π (a|s) π k (a|s) A π k (s, a), clip π (a|s) π k (a|s) , 1 -2ϵ, 1 + 2ϵ A π k (s, a) , where the operation min makes this objective become the lower bound of Equation ( 11). This loss function is quite similar to PPO (Schulman et al., 2017) and the only difference is the state distribution. Therefore, we claim that online on-policy algorithms are naturally able to solve offline RL.

5. DISCUSSIONS AND IMPLEMENTATION DETAILS

In this section, we first directly highlight why BPPO can solve offline reinforcement learning, namely, how to overcome the overestimation issue. Then we discuss some implementation details, especially, the approximation of the advantage A π k (s, a). Finally, we analyze the relation between BPPO and previous algorithms including Onestep RL and iterative methods. Why BPPO can solve offline RL? According to the final loss ( 14) and Equation ( 13), BPPO actually constrains the closeness by the expectation of the total variational divergence: E s∼ρ D (•),a∼π k (•|s) π (a|s) π k (a|s) -1 ≤ 2ϵ. ( ) If k = 0, this equation ensures the closeness between learned policy π and behavior policy πβ . When k > 0, one issue worthy of attention is whether the closeness between learned policy π and π k can indirectly constrain the closeness between π and πβ . To achieve this, also to prevent the learned policy π completely away from πβ , we introduce a technique called clip ratio decay. As the policy updates, the clip ratio ϵ gradually decreases until reaching a certain training step (such as 200 steps): ϵ i = ϵ 0 × (σ) i IF i ≤ 200 ELSE ϵ i = ϵ 200 ( ) here i denotes the training steps, ϵ 0 denotes the initial clip ratio, and σ ∈ (0, 1] is the decay coefficient. From Figure 1 (a) and 1(b), we can find that the ratio π k /π β may be out of the certain range [1 -2ϵ, 1 + 2ϵ] (the region surrounded by the dotted pink and purple line) without clip ratio decay technique (also σ = 1). But the ratio stays within the range stably when the decay is applied which means the Equation ( 15) can ensure the closeness between the final learned policy by BPPO and behavior policy. How to approximate the advantage? When calculating the loss function ( 14), the only difference from the online situation is the approximation of advantage A π k (s, a). In online RL, GAE (Generalized Advantage Estimation) (Schulman et al., 2015b) approximates the A π k using the data collected by policy π k . Obviously, GAE is inappropriate in the offline situations due to the existence of online interaction. As a result, BPPO has to calculate the advantage  A π k = Q π k -V π β in Initialize k = 0 and set π k ← π β & Q π k = Q π β ; 5: for i = 0, 1, 2, • • • , I do 6: A π k = Q π k -V π β 7: Update the policy π by maximizing L k (π); 8: if J(π) > J(π k ) then 9: Set k = k + 1 & π k ← π; 10: if advantage replacement then 11: Q π k = Q π β ; 12: else 13: Calculate Q π k by Q-learning; 14: end if 15: end if 16: end for Besides, we have another simple choice based on the results that π k is close to the π β with the help of clip ratio decay. We can replace all the A π k with the A π β , which may introduce some error but the benefit is that A π β must be more accurate than A π k since off-policy estimation is potentially dangerous, especially in the offline setting. We conduct a series of experiments in Section 7.2 to compare these two implementations and find that the latter one, advantage replacement, is better. Based on the above implementation details, we summarize the whole workflow of BPPO in Algorithm 1. What is the relation between BPPO, Onestep RL and iterative methods? Since BPPO is highly related to on-policy algorithms, it is naturally associated with Onestep RL (Brandfonbrener et al., 2021) that solves offline RL without off-policy evaluation. If we remove lines 8∼15 in Algorithm 1, we get Onestep version of BPPO, which means only the behavior policy πβ is improved. In contrast, BPPO also improves π k , the policy that has been improved over πβ . The right figure shows the difference between BPPO and its Onestep version: Onestep strictly requires the new policy close to πβ , while BPPO appropriately loosens this restriction. 𝜋 0 ො 𝜋 𝛽 𝜖 0 𝜋 0 ො 𝜋 𝛽 𝜋 1 𝜋 2 𝜖 1 𝜖 2 𝜖 0 If we calculate the Q-function in off-policy manner, namely, line 13 in Algorithm 1, the method switches to an iterative style. If we adopt advantage replacement, line 11, BPPO only estimates the advantage function once but updates many policies, from πβ to π k . Onestep RL estimates the Q-function once and use it to update estimated behavior policy. Iterative methods estimate Q-function several times and then update the corresponding policy. Strictly speaking, BPPO is neither an Onestep nor an iterative method. BPPO is a special case between these two types.

6. RELATED WORK

Offline Reinforcement Learning Most of the online off-policy methods fail or underperform in offline RL due to extrapolation error (Fujimoto et al., 2019) or distributional shift (Levine et al., 2020) . Thus most offline algorithms typically augment existing off-policy algorithms with a penalty measuring divergence between the policy and the offline data (or behavior policy). Depending on how to implement this penalty, a variety of methods were proposed such as batch constrained (Fujimoto et al., 2019) , KL-control (Jaques et al., 2019; Liu et al., 2022b) , behavior-regularized (Wu et al., 2019; Fujimoto & Gu, 2021) and policy constraint (Kumar et al., 2019; Levine et al., 2020; Kostrikov et al., 2021) . Other methods augment BC with a weight to make the policy favor high advantage actions (Wang et al., 2018; Siegel et al., 2020; Peng et al., 2019; Wang et al., 2020) . Some methods extra introduced Uncertainty estimation (An et al., 2021b; Bai et al., 2022) or conservative (Kumar et al., 2020; Yu et al., 2021b; Nachum et al., 2019) estimation to overcome overestimation. Monotonic Policy Improvement Monotonic policy improvement in online RL was first introduced by Kakade & Langford (2002) . On this basis, two classical on-policy methods Trust Region Policy Optimization (TRPO) (Schulman et al., 2015a) and Proximal Policy Optimization (PPO) (Schulman et al., 2017) were proposed. Afterwards, monotonic policy improvement has been extended to constrained MDP (Achiam et al., 2017) , model-based method (Luo et al., 2018) and off-policy RL (Queeney et al., 2021; Meng et al., 2021) . The main idea behind BPPO is to regularize each policy update by restricting the divergence. Such regularization is often used in unsupervised skill learning (Liu et al., 2021; 2022a; Tian et al., 2021) and imitation learning (Xiao et al., 2019; Kang et al., 2021) . Xu et al. (2021) mentions that offline algorithms lack guaranteed performance improvement over the behavior policy but we are the first to introduce monotonic policy improvement to solve offline RL.

7. EXPERIMENTS

We conduct a series of experiments on the Gym (v2), Adroit (v1), Kitchen (v0) and Antmaze (v2) from D4RL (Fu et al., 2020) to evaluate the performance and analyze the design choice of Behavior Proximal Policy Optimization (BPPO). Specifically, we aim to answer: 1) How does BPPO compare with previous Onestep and iterative methods? 2) What is the superiority of BPPO over its Onestep and iterative version? 3) What is the influence of hyperparameters clip ratio ϵ and clip ratio decay σ? 

7.1. RESULTS ON D4RL BENCHMARKS

We first compare BPPO with iterative methods including CQL (Kumar et al., 2020) and TD3+BC (Fujimoto & Gu, 2021) , and Onestep methods including Onestep RL (Brandfonbrener et al., 2021) and IQL (Kostrikov et al., 2021) . Most results of Onestep RL, IQL, CQL, TD3+BC are extracted from the paper IQL and the results with symbol * are reproduced by ourselves. Since BPPO first estimates a behavior policy and then improves it, we list the results of BC on the left side of BPPO. From Table 1 , we find BPPO achieves comparable performance on each task of Gym and slightly outperforms when considering the total performance. For Adroit and Kitchen, BPPO prominently outperforms other methods. Compared to BC, BPPO achieves 51% performance improvement on all D4RL tasks. Interestingly, our implemented BC on Adroit and Kitchen nearly outperform the baselines, which may imply improving behavior policy rather than learning from scratch is better. Next, we evaluate whether BPPO can solve more difficult tasks with sparse reward. For Antmaze tasks, we also compare BPPO with Decision Transformer (DT) (Chen et al., 2021) , RvS-G and RvS-R (Emmons et al., 2021) . DT conditions on past trajectories to predict future actions using Transformer. RvS-G and RvS-R condition on goals or rewards to learn policy via supervised learning. As shown in Table 2 , BPPO can outperform most tasks and is significantly better than other algorithms in the total performance of all tasks. We adopt Filtered BC in last four tasks, where only the successful trajectories is selected for behavior cloning. The performance of CQL and IQL is very impressive since no additional operations or information is introduced. RvS-G uses the goal to overcome the sparse reward challenge. The superior performance demonstrates BPPO can also considerably improve the policy performance based on (Filtered) BC on tasks with sparse reward. In Figure3, we observe that both BPPO and Onestep BPPO can outperform BC (the orange dotted line). This indicates both of them can achieve monotonic improvement over behavior policy πβ . Another important result is that BPPO is consitently better than Onestep BPPO and this demonstrates two key points: First, improving π k to fully utilize information is necessary. Second, compared to strictly restricting the learned policy close to the behavior policy, appropriate looseness is useful. BPPO v.s. iterative BPPO When approximating the advantage A π k , we have two implementation choices. One is advantage replacement (line 11 in Algorithm 1). The other one is off-policy Q-estimation (line 13 in Algorithm 1), corresponding to iterative BPPO. Both of them will introduce extra error compared to true A π k . The error of the former comes from replacement A π k ← A π β while the latter comes from the off-policy estimation itself. We compare BPPO with iterative BPPO in Figure 4 and find that advantage replacement, namely BPPO, is obviously better.

*UDGLHQW6WHS×

1RUPDOL]HG5HWXUQ (a) halfcheetah-medium-replay *UDGLHQW6WHS× 1RUPDOL]HG5HWXUQ (b) walker2d-medium-replay *UDGLHQW6WHS× 1RUPDOL]HG5HWXUQ (c) halfcheetah-medium-expert *UDGLHQW6WHS× 1RUPDOL]HG5HWXUQ %332 %332off = 5 %332off = 10 %332off = 20 %332off = 100 %& (d) walker2d-medium-expert Figure 4 : The comparison between BPPO (the green curves) and its iterative versions in which we update the Q network to approximate Q π k instead of Q πβ using in BPPO. In particular, we use "BPPO of f =5 " to denote that we update Q network for 5 gradient steps per policy training step.

7.3. ABLATION STUDY OF DIFFERENT HYPERPARAMETERS

In this section, we evaluate the influence of clip ratio ϵ and its decay rate σ. Clip ratio restricts the policy close to behavior policy and it directly solves the offline overestimation. Since ϵ also appears in PPO, we can set it properly to avoid catastrophic performance, which is the unique feature of BPPO. σ gradually tightens this restriction during policy improvement. We show how these coefficients contribute to the performance of BPPO and more ablations can be found in Appendix G, I, and H. Firstly, we analyze five values of the clip coefficient ϵ = (0.05, 0.1, 0.2, 0.25, 0.3). In most environment, like hopper-medium-expert 5(b), different ϵ shows no significant difference so we choose ϵ = 0.25, while only ϵ = 0.1 is obviously better than others for hopper-medium-replay. We then demonstrate how the clip ratio decay (σ = 0.90, 0.94, 0.96, 0.98, 1.00) affects the performance of BPPO. As shown in Figure 5 (c), a low decay rate (σ = 0.90) or no decay (σ = 1.00) may cause crash during training. We use σ = 0.96 to achieve stable policy improvement for all environments.

8. CONCLUSION

Behavior Proximal Policy Optimization (BPPO) starts from offline monotonic policy improvement, using the loss function of PPO to elegantly solve offline RL without any extra constraint or regularization introduced. Theoretical derivations and extensive experiments show that the inherent conservatism from the on-policy method PPO is naturally suitable to overcome overestimation in offline RL. BPPO is simple to implement and achieves superior performance on the D4RL dataset. A PROOF OF PERFORMANCE DIFFERENCE THEOREM 1 Proof. First note that A π (s, a) = E s ′ ∼p(s ′ |s,a) [r(s, a) + γV π (s ′ ) -V π (s)] . Therefore, E τ ∼P π ′ T t=0 γ t A π (s t , a t ) =E τ ∼P π ′ T t=0 γ t (r (s t , a t ) + γV π (s t+1 ) -V π (s t )) =E τ ∼P π ′ -V π (s 0 ) + T t=0 γ t r (s t , a t ) = -E s0 [V π (s 0 )] + E τ ∼P π ′ T t=0 γ t r (s t , a t ) = -J (π) + J (π ′ ) ≜J ∆ (π ′ , π) Now the first equation in 1 has been proved. For the proof of second equation, we decompose the expectation over the trajectory into the sum of expectation over state-action pairs: E τ ∼P π ′ T t=0 γ t A π (s t , a t ) = T t=0 s P (s t = s|π ′ ) E a∼π ′ (•|s) γ t A π (s, a) = s T t=0 γ t P (s t = s|π ′ ) E a∼π ′ (•|s) [A π (s, a)] = s ρ π ′ (s) E a∼π ′ (•|s) [A π (s, a)] =E s∼ρ π ′ (s) ,a∼π ′ (•|s) [A π (s, a)] B PROOF OF PROPOSITION 1 Proof. For state-action pair (s t , a t ) ∈ D, it can be viewed as one deterministic policy that satisfies π D (a = a t |s t ) = 1 and π D (a ̸ = a t |s t ) = 0. So D T V (D∥π β ) [s t ] = D T V (π D ∥π β ) [s t ] = 1 2 E a |π D (a|s t ) -πβ (a|s t )| = 1 2 [P (a t ) |π D (a t |s t ) -πβ (a t |s t )| + P (a ̸ = a t ) |π D (a|s t ) -πβ (a|s t )|] da = 1 2 [P (a t ) (1 -πβ (a t |s t )) + P (a ̸ = a t ) πβ (a ̸ = a t |s t )] da = 1 2 [P (a t ) (1 -πβ (a t |s t )) + (1 -P (a t )) (1 -πβ (a t |s t ))] da = 1 2 (1 -πβ (a t |s t )) C PROOF OF THEOREM 2 The definition of Āπ,π β (s) is as follows: Āπ,π β (s) = E a∼π(•|s) A πβ (s, a) Note that the expectation of advantage function A πβ (s, a) depends on another policy π rather than πβ , so Āπ,π β (s) ̸ = 0. Furthermore, given the Āπ,π β (s), the performance difference in Theorem 2 can be rewritten as: J ∆ (π, πβ ) = E s∼ρπ(•),a∼π(•|s) A πβ (s, a) = E s∼ρπ(•) Āπ,π β (s) J ∆ (π, πβ ) = E s∼ρ D (•),a∼π(•|s) A πβ (s, a) = E s∼ρ D (•) Āπ,π β (s) Lemma 1. For all state s, Āπ,π β (s) ≤ 2 max a A πβ (s, a) • D T V (π∥π β ) [s] Proof. The expectation of advantage function A π (s, a) over its policy π equals zero: E a∼π [A π (s, a)] = E a∼π [Q π (s, a) -V π (s)] = E a∼π [Q π (s, a)] -V π (s) = 0 Thus, with the help of Hölder's inequality, we get Āπ,π β (s) = E a∼π(•|s) A πβ (s, a) -E a∼π β (•|s) A πβ (s, a) ≤ ∥π (a|s) -πβ (a|s)∥ 1 A πβ (s, a) ∞ =2D T V (π∥π β ) [s] • max a A πβ (s, a) , ∀s Lemma 2. ( (Achiam et al., 2017 )) The divergence between two unnormalized visitation frequencies, ∥ρ π (•) -ρ π ′ (•)∥ 1 , is bounded by an average total variational divergence of the policies π and π ′ : ∥ρ π (•) -ρ π ′ (•)∥ 1 ≤ 2γ E s∼ρ π ′ (•) [D T V (π∥π ′ ) [s]] Given this powerful lemma and other preparation, now we are able to derive the bound of J ∆ (π, πβ ) -J ∆ (π, πβ ) : J ∆ (π, πβ ) -J ∆ (π, πβ ) = E s∼ρπ(•) Āπ,π β (s) -E s∼ρ D (•) Āπ,π β (s) = E s∼ρπ(•) Āπ,π β (s) -E s∼ρ πβ (•) Āπ,π β (s) + E s∼ρ πβ Āπ,π β (s) -E s∼ρ D (•) Āπ,π β (s) Based on Hölder's inequality and lemma 2, we can bound the first term as follows: E s∼ρπ(•) Āπ,π β (s) -E s∼ρ πβ (•) Āπ,π β (s) ≤ ρ π (•) -ρ πβ (•) 1 Āπ,π β (s) ∞ ≤2γ E s∼ρ πβ (•) [D T V (π∥π β ) [s]] • max s Āπ,π β (s) For the second term, we can derive similar bound and furthermore let D T V (D∥π β ) [s] = 1 2 (1 -πβ (a t |s t ) ). Finally, using lemma 1, we get J ∆ (π, πβ ) -J ∆ (π, πβ ) ≤2γ max s Āπ,π β (s) E s∼ρ πβ (•) [D T V (π∥π β ) [s]] + E s∼ρ D (•) [D T V (D∥π β ) [s]] =2γ max s Āπ,π β (s) E s∼ρ πβ (•) [D T V (π∥π β ) [s]] + E s∼ρ D (•) 1 2 [1 -πβ (a|s)] ≤4γ max s,a A πβ (s, a) • max s D T V (π∥π β ) [s] • E s∼ρ πβ (•) [D T V (π∥π β ) [s]] + E s∼ρ D (•) 1 2 [1 -πβ (a|s)] D PROOF OF THEOREM 3 As an extension of Theorem 2, the proof process of Theorem 3 is similar. Based on the Equation (28), we can directly derive the final bound: J ∆ (π, π k ) -J ∆ (π, π k ) = E s∼ρπ(•),a∼π(•|s) [A π k (s, a)] -E s∼ρ D (•),a∼π(•|s) [A π k (s, a)] = E s∼ρπ(•) Āπ,π k (s) -E s∼ρπ k (•) Āπ,π k (s) + E s∼ρπ k (•) Āπ,π k (s) -E s∼ρ πβ (•) Āπ,π k (s) + E s∼ρ πβ (•) Āπ,π k (s) -E s∼ρ D (•) Āπ,π k (s) ≤2γ max s Āπ,π k (s) E s∼ρπ k (•) [D T V (π∥π k ) [s]] + E s∼ρ πβ (•) [D T V (π k ∥π β ) [s]] + E s∼ρ D (•) 1 2 [1 -πβ (a|s)] ≤4γ max s,a |A π k (s, a)| • max s D T V (π∥π k ) [s] • E s∼ρπ k (•) [D T V (π∥π k ) [s]] + E s∼ρ πβ (•) [D T V (π k ∥π β ) [s]] + E s∼ρ D (•) 1 2 [1 -πβ (a|s)] E WHY GAE IS UNAVAILABLE IN OFFLINE SETTING? In traditional online situation, advantage A π k (s, a) is estimated by Generalized Advantage Estimation (GAE) (Schulman et al., 2015b) using the data collected by policy π k . But in offline RL, only offline dataset D = (s t , a t , s t+1 , r t ) N t=1 from true behavior policy π β is available. The advantage of (s t , a t ) calculated by GAE is as follow: A π β (s t , a t ) = ∞ l=0 (γλ) l r t+l + γV π β (s t+l+1 ) -V π β (s t+l ) . ( ) GAE can only calculate the advantage of (s t , a t ) ∈ D. For (s t , ãt ) ∼ D, where ãt is an indistribution action sampling but (s t , ãt ) ̸ ∈ D, GAE is unable to give any estimation. This is because the calculation process of GAE depends on the trajectory and does not have the ability to generalize to unseen state-action pairs. Therefore, GAE is not a satisfactory choice for offline RL. Offline RL forbids the interaction with environment, so data usage should be more efficient. Concretely, we expect advantage approximation method can not only calculate the advantage of (s t , a t ), but also (s t , ãt ). As a result, we directly estimate advantage with the definition A π β (s, a) = Q π β (s, a) -V π β (s), where Q-function is estimated by SARSA and value function by fitting returns T t=0 γ t r(s t , a t ) with MSE loss. This function approximation method can generalize to the advantage of (s t , ãt ).

F THEORETICAL ANALYSIS FOR Advantage Replacement

We choose to replace all A π k with trustworthy A πβ then theoretically measure the difference rather than empirically make A π k learned by Q-learning more accurate. The difference caused by replacing the A π k in J ∆ (π, π k ) with A π β (s, a) can be measured in the following theorem: Theorem 4. Given the distance D T V (π k ∥π β ) [s] and assume the reward function satisfies |r (s, a)| ≤ R max for all s, a, then J ∆ (π, π k ) -E s∼ρ D (•),a∼π(•|s) A π β (s, a) ≤ 2γ (γ + 1) • R max • E s∼ρπ β (•) [D T V (π k ∥π β ) [s]] . (32) Proof. First note that A π (s, a) = E s ′ ∼p(s ′ |s,a) [r(s, a) + γV π (s ′ ) -V π (s)]. Then we have E s∼ρ D (•),a∼π(•|s) [A π k (s, a)] -E s∼ρ D (•),a∼π(•|s) A π β (s, a) = E s∼ρ D (•),a∼π(•|s) E s ′ ∼p(s ′ |s,a) γ V π k (s ′ ) -V π β (s ′ ) -V π k (s) -V π β (s) ≤E s∼ρ D (•),a∼π(•|s) E s ′ ∼p(s ′ |s,a) γ V π k (s ′ ) -V π β (s ′ ) + V π k (s) -V π β (s) Similarly to Equation ( 18), the value function can be rewritten as V π (s) = E s∼ρ π(•) [r (s)]. Then the difference between two value function can be measured using Hölder's inequality and lemma 2: V π k (s) -V π β (s) = E s∼ρ π k (•) [r (s)] -E s∼ρ π β (•) [r (s)] ≤ ρ π k (•) -ρ π β (•) 1 ∥r (s)∥ ∞ ≤ 2γ E s∼ρπ β (•) [D T V (π k ∥π β ) [s]] • max s |r (s)| Thus, the final bound is E s∼ρ D (•),a∼π(•|s) [A π k (s, a)] -E s∼ρ D (•),a∼π(•|s) A π β (s, a) ≤E s∼ρ D (•),a∼π(•|s) E s ′ ∼p(s ′ |s,a) 2γ 2 E s ′ ∼ρπ β (•) [D T V (π k ∥π β ) [s ′ ]] • max s ′ |r (s ′ )| +2γ E s∼ρπ β (•) [D T V (π k ∥π β ) [s]] • max s |r (s)| =2γ (γ + 1) max s |r (s)| E s∼ρπ β (•) [D T V (π k ∥π β ) [s]] Note that the right end term of the equation is irrelevant to the policy π and can be viewed as a constant when optimizing π. Combining the result of Theorem 3 and 4, we get the following corollary: Corollary 1. Given the distance D T V (π∥π k ) [s], D T V (π k ∥π β ) [s] and D T V (D∥π β ) [s] = 2 (1 -πβ (a|s)), we can derive the following bound: J ∆ (π, π k ) ≥ E s∼ρ D (•),a∼π(•|s) A π β (s, a) -4γA π k • max s D T V (π∥π k ) [s] • E s∼ρπ k (•) [D T V (π∥π k ) [s]] -4γA π k • max s D T V (π∥π k ) [s] • E s∼ρ πβ (•) [D T V (π k ∥π β ) [s]] -2γA π k • max s D T V (π∥π k ) [s] • E s∼ρ D (•) [1 -πβ (a|s)] -C π k ,π β , where A π k = max s,a |A π k (s, a)| and C π k ,π β = 2γ (γ + 1) • max s,a |r (s, a)| E s∼ρπ β (•) [D T V (π k ∥π β ) [s]].

Conclusion 3

To guarantee the true objective J ∆ (π, π k ) non-decreasing, we can also simultaneously maximize E s∼ρ D (•),a∼π(•|s) A π β (s, a) and minimize [max s D T V (π∥π k ) [s]], k = 0, 1, 2, • • • .

G ABLATION STUDY ON AN ASYMMETRIC COEFFICIENT

In this section, we give the details of all hyperparameter selections in our experiments. In addition to the aforementioned clip ratio ϵ and its clip decay coefficient σ, we introduce the ω ∈ (0, 1) as an asymmetric coefficient to adjust the advantage Āπ β based on the positive or negative of advantage: Āπ β = |ω -1(A π β < 0)|A π β . For ω > 0.5, that downweights the contributions of the state-action value Q π β smaller than it's expectation, i.e., V π β while distributing more weights to larger Q π β . The asymmetric coefficient can adjust the weight of advantage based on the Q performance, which downweights the contributions of the state-action value Q smaller than its expectation while distributing more weights to advantage with a larger Q value. We analyze how the three coefficients affect the performance of BPPO. We analyze three values of the asymmetric coefficient ω = (0.5, 0.7, 0.9) in three Gym environments. Figure 6 shows that ω = 0.9 is best for these tasks, especially in hopper-medium-v2 and hoppermedium-replay-v2. With a larger value ω, the policy improvement can be guided in a better direction, leading to better performance in Gym environments. Based on the performance of different coefficient values above, we use the asymmetric advantage coefficient ω = 0.9 for the Gym dataset training and ω = 0.7 for the Adroit, Antmaze, and Kitchen datasets training, respectively. 

H IMPORTANCE RATIO DURING TRAINING

In this section, we consider exploring whether the importance weight between the improved policy π k and the behavior policy π β will be arbitrarily large. To this end, we quantify this importance weight in the training phase in Figure 7 . In Figure 7 , we often observe that the ratio of the BPPO with decay always stays in the clipped region (the region surrounded by the dotted yellow and red line). However, the BPPO without decay is beyond the region in Figure 7 (a) and 7(b). That demonstrates the improved policy without decay is farther away from the behavior policy than the case of BPPO with decay. It may cause unstable performance and even crashing, as shown in Figure 5 (c), 5(d) and 10 when σ = 1.00 (i.e., without decay). Figure 7 : Visualization of the importance weight between the updated policy and the behavior policy trained by BC. When the performance of the policy is improved, we calculate importance weight (i.e., the probability ratio) between the improved policy and the behavior policy.

J EXTRA COMPARISONS

In this section, we have added the EDAC (An et al., 2021a) , LAPO (Chen et al., 2022) , RORL (Yang et al., 2022) , and ATAC (Cheng et al., 2022) as the comparison baselines to further evaluate the superiority of the BPPO. Although the performance of the BPPO is slightly worse than the SOTA methods on Gym environment, the BPPO significantly outperforms all methods on the Adroit, Kitchen, and Antmaze datasets and has the best overall performance over all datasets. 



Figure 1: Visualization of the importance weight between the updated policy π k and the estimated behavior policy πβ .

Figure 2: The difference between Onestep BPPO (left) and BPPO (right), where the decreasing circle corresponds to ϵ decay.

Figure3: The comparison between BPPO and Onestep BPPO. The hyperparameters of both methods are tuned through the grid search, and then we exhibit their learning curves with the best performance.

Figure 5: Ablation study on clip ratio ϵ (5(a), 5(b)) and clip ratio decay σ (5(c), 5(d)).

Figure6: Ablation study on coefficient ω. We optimize the hyperparameters through the grid search, then we fix the value of other coefficients with the best performance and change the value of the asymmetric coefficient to analyze how it affects the BPPO. In particular, ω = 0.5 denotes without the asymmetric coefficient during the training phase (contributing equal value to all Advantages).

hopper-medium-expert-v2

Note that the value function is V π β rather than V π k since the state distribution has been changed into s ∼ ρ D (•) in Theorem 2, 3.

The normalized results on D4RL Gym, Adroit, and Kitchen. We bold the best results and BPPO is calculated by averaging mean returns over 10 evaluation trajectories and five random seeds. The symbol * specifies that the results are reproduced by running the offical open-source code.

The normalized results on D4RL Antmaze tasks. The results of CQL and IQL are extracted from paper IQL while others are extracted from paper RvS. In the BC column, symbol * specifies the Filtered BC(Emmons et al., 2021) which removes the failed trajectories instead of standard BC.

The normalized results of all algorithms on Gym locomotion and Adroit datasets. The results of the EDAC, RORL, and ATAC are extracted from their original articles.

The normalized results of all algorithms on Kitchen dataset. The results of the LAPO are extracted from its original article.

The normalized results of all algorithms on Antmaze dataset. The results of the RORL are extracted from its original article.

9. ACKNOWLEDGEMENTS

This work was supported by the National Science and Technology Innovation 2030 -Major Project (Grant No. 2022ZD0208800), and NSFC General Program (Grant No. 62176215). We really appreciate Li He and Yachen Kang for helpful discussions and writing polishing.

availability

//github.com/

I COEFFICIENT PLOTS OF ONESTEP BPPO

In this section, we exhibit the learning curves and coefficient plots of Onestep BPPO. As shown in Figure 8 and 9, ϵ = 0.25 and ω = 0.9 are best for those tasks. Figure 10 shows how the clip coefficient decay affects the performance of the Onestep BPPO. We can observe that the performance of the curve without decay or with low decay is unstable over three tasks and even crash during training in the "hopper-medium-replay-v2" task. Thus, we select σ = 0.96 to achieve a stable policy improvement for Onestep BPPO. that We use the coefficients with the best performance to compare with the BPPO in Figure 3 . 

K IMPLEMENTATION AND EXPERIMENT DETAILS

Following the online PPO method, we use tricks called 'code-level optimization' including learning rate decay, orthogonal initialization, and normalization of the advantage in each mini-batch, which are considered very important to the success of the online PPO algorithm (Engstrom et al., 2020) . We clip the concatenated gradient of all parameters such that the 'global L2 norm' does not exceed 0.5. We use 2 layers MLP with 1024 hidden units for the Q and policy networks, and use 3 layers MLP with 512 hidden units for value function V . Our method is constructed by Pytorch (Paszke et al., 2019) . Next, we introduce the training details of the Q, V , (estimated) behavior policy πβ , and target policy π, respectively.• Q and V networks training: we run 2 × 10 6 steps for fitting value Q and V functions using learning rate 10 -4 , respectively.• (Estimated) behavior policy πβ training: we run 5 × 10 5 steps for πβ cloning using learning rate 10 -4 .• Target policy π training: during policy improvement, we use the learning rate decay, i.e., decaying in each interval step in the first 200 gradient steps and then remaining the learning rate (decay rate σ = 0.96). We run 1,000 gradient steps for policy improvement for Gym, Adroit, and Kitchen tasks and run 100 gradient steps for Antmaze tasks. The selections of the initial policy learning rate, initial clip ratio, and asymmetric coefficient are listed in Table 6 , respectively. 

