OFFLINE REINFORCEMENT LEARNING FROM HETERO-SKEDASTIC DATA VIA SUPPORT CONSTRAINTS

Abstract

Offline reinforcement learning (RL) learns policies entirely from static datasets, thereby avoiding the challenges associated with online data collection. Practical applications of offline RL will inevitably require learning from datasets where the variability of demonstrated behaviors changes non-uniformly across the state space. For example, at a red light, nearly all human drivers behave similarly by stopping, but when merging onto a highway, some drivers merge quickly, efficiently, and safely, while many hesitate or merge dangerously. Both theoretically and empirically, we show that typical offline RL methods, which are based on distribution constraints fail to learn from data with such non-uniform variability, due to the requirement to stay close to the behavior policy to the same extent across the state space. Ideally, the learned policy should be free to choose per state how closely to follow the behavior policy to maximize long-term return, as long as the learned policy stays within the support of the behavior policy. To instantiate this principle, we reweight the data distribution in conservative Q-learning (CQL) to obtain an approximate support constraint formulation. The reweighted distribution is a mixture of the current policy and an additional policy trained to mine poor actions that are likely under the behavior policy. Our method, CQL (ReDS), is simple, theoretically motivated, and improves performance across a wide range of offline RL problems in Atari games, navigation, and pixel-based manipulation.

1. INTRODUCTION

Recent advances in offline RL (Levine et al., 2020; Lange et al., 2012) hint at exciting possibilities in learning high-performing policies, entirely from offline datasets, without requiring dangerous (Garcıa & Fernández, 2015) or expensive (Kalashnikov et al., 2018) active interaction with the environment. Analogously to the importance of data diversity in supervised learning (Deng et al., 2009) , the practical benefits of offline RL depend heavily on the coverage of behavior in the offline datasets (Kumar et al., 2022) . Intuitively, the dataset must illustrate the consequences of a diverse range of behaviors, so that an offline RL method can determine what behaviors lead to high returns, ideally returns that are significantly higher than the best single behavior in the dataset. We posit that combining many realistic sources of data can provide this kind of coverage, but doing so can lead to the variety of demonstrated behaviors varying in highly non-uniform ways across the state space, i.e. heteroskedastic datasets. For example, a dataset of humans driving cars might show very high variability in driving habits, with some drivers being timid and some more aggressive, but remain remarkably consistent in critical states (e.g., human drivers are extremely unlikely to swerve in an empty road or drive off a bridge). A good offline RL algorithm should combine the best parts of each behavior in the dataset -e.g., in the above example, the algorithm should produce a policy that is as good as the best human in each situation, which would be better than any human driver overall. At the same time, the learned policy should not attempt to extrapolate to novel actions in subset of the state space where the distribution of demonstrated behaviors is narrow (e.g., the algorithm should not attempt to drive off a bridge). How effectively can current offline RL methods selectively choose on a per-state basis how closely to stick to the behavior policy? Most existing methods (Kumar et al., 2019; 2020; Kostrikov et al., 2021b; a; Wu et al., 2019; Fujimoto et al., 2018a; Jaques et al., 2019) constraint the learned policy to stay close to the behavior policy, so-called "distributional constraints". Our first contribution consists of empirical and theoretical evidence demonstrating that distributional constraints are insufficient when the heteroskedas- ticity of the demonstrated behaviors varies non-uniformly across states, because the strength of the constraint is state-agnostic, and may be overly conservative at some states even when it is not conservative enough at other states. We also devise a measure of heteroskedasticity that enables us to determine if certain offline datasets would be challenging for distributional constraints. Our second contribution is a simple and theoretically-motivated observation: distribution constraints against a reweighted version of the behavior policy give rise to support constraints. That is, the return-maximization optimization process can freely choose state-by-state how much the learned policy should stay close to the behavior policy, so long as the learned policy remains within the data support. We show that it is particularly convenient to instantiate this insight on top of conservative Q-learning (CQL) (Kumar et al., 2020) , a recent offline RL method. The new method, CQL (ReDS), only changes minimally the form of regularization, design decisions employed by CQL and inherits existing hyper-parameter values. CQL (ReDS) attains better performance than recent distribution constraints methods on a variety of tasks with more heteroskedastic distributions.

2. PRELIMINARIES ON DISTRIBUTIONAL CONSTRAINTS OFFLINE RL

The goal in offline RL is find the optimal policy in a Markov decision process (MDP) specified by the tuple M = (S, A, T, r, µ 0 , γ). S, A denote the state and action spaces. T (s ′ |s, a) and r(s, a) represent the dynamics and reward function. µ 0 (s) denotes the initial state distribution. γ ∈ (0, 1) denotes the discount factor. The goal is to learn a policy that maximizes the return, denoted by J(π) := 1 1-γ E (st,at)∼π [ t γ t r(s t , a t )]. We must find the best possible policy while only having access to an offline dataset of transitions collected using a behavior policy π β , D = {(s, a, r, s ′ )}. Offline RL via distributional constraints. Most offline RL algorithms regularize the learned policy π from querying the target Q-function on unseen actions (Fujimoto et al., 2018a; Kumar et al., 2019) , either implicitly or explicitly. For our theoretical analysis, we will abstract the behavior of distributional constraint offline RL algorithms into a generic formulation following Kumar et al. (2020) . As shown in Equation 1, we consider the problem where we must maximize the return of the learned policy π (in the empirical MDP) J(π), while also penalizing the divergence from π β : max π E s∼ d π J(π) -αD(π, π β )(s) (generic distributional constraint) (1) where D denotes a divergence between the learned policy π and the behavior policy π β at state s. error (Lillicrap et al., 2015; Fujimoto et al., 2018b; Tuomas Haarnoja & Levine, 2018) . The first term R(θ) (in red) attempts to prevent overestimation in the Q-values for out-of-distribution (OOD)



Figure1: Failure mode of distributional constraints. In this navigation task, an offline RL algorithm must find a path from the start state to the goal state as indicated in (a). The offline dataset provided exhibits nonuniform coverage at different state, e.g., in the state marked as "B" located in a wide room has more uniform action distribution, whereas the states in the narrow hallways exhibit a more narrow action distribution. This is akin to how the behavior of human drivers varies in certain locations ("B"), but is very similar in other situations ("A"). To perform well, an algorithm must stay close to the data in the hallways ("A"), but deviate significantly from the data in the rooms ("B"), where the data supports many different behaviors (most are not good). AWR and CQL become stuck because they stay too close to the bad behavior policy in the rooms, e.g. the left and right arrows near State B in Fig (b) and (c). Our method, CQL (RedS), learns to ignore the bad behavior policy action in state B and prioritizes the good action, indicated by the downward arrow near State B in (d).

Conservative Q-learning. (Kumar et al., 2020)  enforces the distributional constraint on the policy implicitly. To see why this is the case, consider the CQL objective, which consists of two terms:min θ α (Es∼D,a∼π [Q θ (s, a)] -Es,a∼D [Q θ (s, a)]) a,s ′ ∼D Q θ (s, a) -B π Q(s, a) 2 , (2)where B π Q(s, a) is the Bellman backup operator applied to a delayed target Q-network, Q:B π Q(s, a) := r(s, a) + γE a ′ ∼π(a ′ |s ′ ) [ Q(s ′ , a ′ )].The second term (in blue) is the standard TD

