BEHAVIOR PROXIMAL POLICY OPTIMIZATION

Abstract

Offline reinforcement learning (RL) is a challenging setting where existing off-policy actor-critic methods perform poorly due to overestimating of out-ofdistribution state-action pairs. Thus, various additional augmentations are proposed to keep the learned policy close to the offline dataset (or the behavior policy). In this work, starting from the analysis of offline monotonic policy improvement, we reach a surprising conclusion that online on-policy algorithms are naturally able to solve offline RL. Specifically, the inherent conservatism of these on-policy algorithms is exactly what the offline RL method needs to overcome the overestimation. Based on this, we propose Behavior Proximal Policy Optimization (BPPO), which solves offline RL without any extra constraint or regularization introduced compared to PPO. Extensive experiments on the D4RL benchmark empirically show this extremely succinct method outperforms state-of-the-art offline RL algorithms. Our implementation is available at

1. INTRODUCTION

Typically, reinforcement learning (RL) is thought of as a paradigm for online learning, where the agent interacts with the environment to collect experiences and then uses them to improve itself (Sutton et al., 1998) . This online process poses the biggest obstacles to real-world RL applications because of expensive or even risky data collection in some fields (such as navigation (Mirowski et al., 2018) and healthcare (Yu et al., 2021a) ). As an alternative, offline RL eliminates the online interaction and learns from a fixed dataset collected by some arbitrary and possibly unknown process (Lange et al., 2012; Fu et al., 2020) . The prospect of this data-driven mode (Levine et al., 2020) is pretty encouraging and has been placed with great expectations for solving real-world RL applications. Unfortunately, the major superiority of offline RL, the lack of online interaction, also raises another challenge. The classical off-policy iterative algorithms tend to underperform due to overestimating out-of-distribution (shorted as OOD) state-action pairs, even though offline RL can be viewed as an extreme off-policy case. More specifically, when Q-function poorly estimates the value of OOD state-action pairs during policy evaluation, the agent tends to take OOD actions with erroneously estimated high values, resulting in low-performance after policy improvement (Fujimoto et al., 2019) . Thus, to overcome the overestimation issue, some solutions keep the learned policy close to the behavior policy (or the offline dataset) (Fujimoto et al., 2019; Wu et al., 2019; Fujimoto & Gu, 2021) . Most offline RL algorithms adopt online interactions to select hyperparameters. This is because offline hyperparameter selection, which selects hyperparameters without online interactions, is always an open problem lacking satisfactory solutions (Paine et al., 2020; Zhang & Jiang, 2021) . Deploying the policy learned by offline RL is potentially risky in certain areas (Mirowski et al., 2018; Yu et al., 2021a) since the performance is unknown. However, the risk during online interactions will be greatly reduced if the deployed policy can guarantee better performance than the behavior policy. This inspires us to consider how to use offline dataset to improve behavior policy with a monotonic performance guarantee. We formulate this problem as offline monotonic policy improvement. To analyze offline monotonic policy improvement, we introduce the Performance Difference Theorem (Kakade & Langford, 2002) . During analysis, we find that the offline setting does make the monotonic policy improvement more complicated, but the way to monotonically improve policy remains unchanged. This indicates the algorithms derived from online monotonic policy improvement (such as Proximal Policy Optimization) can also achieve offline monotonic policy improvement. In other words, PPO can naturally solve offline RL. Based on this surprising discovery, we propose Behavior Proximal Policy Optimization (BPPO), an offline algorithm that monotonically improves behavior policy in the manner of PPO. Owing to the inherent conservatism of PPO, BPPO restricts the ratio of learned policy and behavior policy within a certain range, similar to the offline RL methods which make the learned policy close to the behavior policy. As offline algorithms are becoming more and more sophisticated, TD3+BC (Fujimoto & Gu, 2021), which augments TD3 (Fujimoto et al., 2018) with behavior cloning (Pomerleau, 1988) , reminds us to revisit the simple alternatives with potentially good performance. BPPO is such a "most simple" alternative without introducing any extra constraint or regularization on the basis of PPO. Extensive experiments on the D4RL benchmark (Fu et al., 2020) empirically shows that BPPO outperforms state-of-the-art offline RL algorithms.

2. PRELIMINARIES

2.1 REINFORCEMENT LEARNING Reinforcement Learning (RL) is a framework of sequential decision. Typically, this problem is formulated by a Markov decision process (MDP) M = {S, A, r, p, d 0 , γ}, with state space S, action space A, scalar reward function r, transition dynamics p, initial state distribution d 0 (s 0 ) and discount factor γ (Sutton et al., 1998) . The objective of RL is to learn a policy, which defines a distribution over action conditioned on states π (a t |s t ) at timestep t, where a t ∈ A, s t ∈ S. Given this definition, the trajectory τ = (s 0 , a 0 , • • • , s T , a T ) generated by the agent's interaction with environment M can be described as a distribution P π (τ ) = d 0 (s 0 ) T t=0 π (a t |s t ) p (s t+1 |s t , a t ), where T is the length of the trajectory, and it can be infinite. Then, the goal of RL can be written as an expectation under the trajectory distribution J (π) = E τ ∼Pπ(τ ) T t=0 γ t r(s t , a t ) . This objective can also be measured by a state-action value function Q π (s, a), the expected discounted return given the action a in state s: Q π (s, a) = E τ ∼Pπ(τ |s,a) T t=0 γ t r(s t , a t )|s 0 = s, a 0 = a . Similarly, the value function V π (s) is the expected discounted return of a certain state s: V π (s) = E τ ∼Pπ(τ |s) T t=0 γ t r(s t , a t )|s 0 = s . Then, we can define the advantage function: A π (s, a) = Q π (s, a) -V π (s).

2.2. OFFLINE REINFORCEMENT LEARNING

In offline RL, the agent only has access to a fixed dataset with transitions D = (s t , a t , s t+1 , r t )

N t=1

collected by a behavior policy π β . Without interacting with environment M, offline RL expects the agent to infer a policy from the dataset. Behavior cloning (BC) (Pomerleau, 1988) , an approach of imitation learning, can directly imitate the action of each state with supervised learning: πβ = argmax π E (s,a)∼D [log π (a|s)] . Note that the performance of πβ trained by behavior cloning highly depends on the quality of transitions, also the collection process of behavior policy π β . In the rest of this paper, improving behavior policy actually refers to improving the estimated behavior policy πβ , because π β is unknown.

2.3. PERFORMANCE DIFFERENCE THEOREM

Theorem 1. (Kakade & Langford, 2002) Let the discounted unnormalized visitation frequencies as ρ π (s) = T t=0 γ t P (s t = s|π) and P (s t = s|π) represents the probability of the t-th state equals to s in trajectories generated by policy π. For any two policies π and π ′ , the performance difference J ∆ (π ′ , π) ≜ J (π ′ ) -J (π) can be measured by the advantage function: J ∆ (π ′ , π) = E τ ∼P π ′ (τ ) T t=0 γ t A π (s t , a t ) = E s∼ρ π ′ (•),a∼π ′ (•|s) [A π (s, a)] . (2)

availability

https://github.com/

