DROP: CONSERVATIVE MODEL-BASED OPTIMIZA-TION FOR OFFLINE REINFORCEMENT LEARNING Anonymous

Abstract

In this work, we decouple the iterative (bi-level) offline RL from the offline training phase, forming a non-iterative bi-level paradigm that avoids the iterative error propagation over two levels. Specifically, this non-iterative paradigm allows us to conduct inner-level optimization in training (i.e., employing policy/value regularization), while performing outer-level optimization in testing (i.e., conducting policy inference). Naturally, such paradigm raises three core questions (that are not fully answered by prior non-iterative offline RL counterparts like rewardconditioned policy): (Q1) What information should we transfer from the innerlevel to the outer-level? (Q2) What should we pay attention to when exploiting the transferred information for the outer-level optimization? (Q3) What are the benefits of concurrently conducting outer-level optimization during testing? Motivated by model-based optimization (MBO), we proposed DROP, which fully answers the above three questions. Particularly, in the inner-level, DROP decomposes offline data into multiple subsets, and learns a MBO score model (A1). To keep safe exploitation to the score model in the outer-level, we explicitly learn a behavior embedding and introduce a conservative regularization (A2). During testing, we show that DROP permits deployment adaptation, enabling an adaptive inference across states (A3). Empirically, we evaluate DROP on various tasks, showing that DROP gains comparable or better performance compared to prior methods.

1. INTRODUCTION

Offline reinforcement learning (RL) (Lange et al., 2012; Levine et al., 2020) describes a task of learning a policy from previously collected static data. Due to the overestimation of values at out-ofdistribution (OOD) state-actions, recent iterative offline RL methods introduce various policy/value regularization to avoid deviating from the offline data distribution (or support) in the training phase. Then, these methods directly deploy the learned policy in an online environment to test the performance. To unfold our following analysis, we term this kind of learning procedure as iterative bi-level offline RL (Figure 1 left), wherein the inner-level optimization refers to trying to eliminate the OOD issue by constraining the policy/value function, the outer-level optimization refers to trying to learn a better policy that will be employed at testing. Here, we use the "iterative" term to emphasize that the inner-level and outer-level are iteratively optimized in the training phase. However, without enough inner-level optimization (OOD regularization), there is still a distribution shift between the behavior policy and the policy to be evaluated. Further, due to the iterative error exploitation and propagation (Brandfonbrener et al., 2021) over the two levels, performing such an iterative bi-level optimization completely in training often struggles to learn a stable policy/value function. In this work, we thus advocate for non-iterative bi-level optimization (Figure 1 ), all of which, however, partially address the aforementioned questions (we will elaborate these works in Table 1 ). In this work, we propose a new alternative method that transfers an embedding-conditioned (Q-value) score model and we will show that this method sufficiently answers the above questions and benefits most from the non-iterative framework. Before introducing our method, we introduce a conceptually similar task (to the non-iterative bilevel optimization) -offline model-based optimization (MBO, Trabucco et al. ( 2021))foot_1 , which aims to discover, from static input-score pairs, a new design input that will lead to the highest score. Typically, offline MBO first learns a score model that maps the input to its score via supervised regression (corresponding to inner-level optimization), and then performs inference with the learned score model (as " "), for instance, by optimizing the input against the learned score model via gradient ascent (corresponding to the outer-level). To enable this MBO implementation in offline RL, we are required to decompose an offline RL task into multiple sub-tasks, each of which thus corresponds to a behavior policy-return (parameters-return) pair. However, there are practical optimization difficulties when learning the score model (inner-level) and performing inference (outer-level) on high-dimensional policy's parameter space (as input for the score model). At inference, directly extrapolating the learned score model (" ") also tends to drive the high-dimensional candidate policy (parameters) towards out-of-distribution, invalid, and low-scoring parameters (Kumar & Levine, 2020), as these are falsely and over-optimistically scored by the learned score model. To tackle these problems, we suggest (A1) learning low-dimensional embeddings for these sub-tasks decomposed in the MBO implementation, over which we estimate an embedding-conditioned Qvalue as the MBO score model (" " in Q1), and (A2) introduce a conservative regularization, which pushes down the predicted scores on OOD embeddings, so as to avoid over-optimistic exploitation and protect against producing unconfident embeddings when conducting outer-level optimization (policy/embedding inference). Meanwhile, (A3) learning embedding permits deployment adaptation, which means we can dynamically adjust inferred embeddings across different states in testing (aka test-time adaptation). We name our method DROP (Design fROm Policies). Compared with standard offline MBO for parameter design (Trabucco et al., 2021) , deployment adaptation in DROP leverages the MDP structure of RL tasks, rather than simply conducting inference at the beginning of test rollout. Empirically, we demonstrate that DROP can effectively extrapolate a better policy that benefits from the non-iterative framework by answering the above three questions, and can achieve comparable or better performance compared to many prior offline RL algorithms.



Next we will use A1, A2, and A3 to denote our answers to the raised questions (Q1, Q2, and Q3) respectively. PRELIMINARIES2.1 REINFORCEMENT LEARNING AND OFFLINE REINFORCEMENT LEARNINGWe model the interaction between agent and environment as a Markov Decision Process (MDP)(Sutton & Barto, 2018), denoted by the tuple (S, A, R, P, µ), where S is the state space, 2 Please note that this MBO is different from the regular model-based RL (MBRL for short), where the model in MBO denotes a score model while that in MBRL deontes the transition dynamics (or reward) model.



Figure1: A framework for bi-level offline RL optimization, where the inner-level optimization refers to regularizing the policy/value function (for OOD issues) and the outer-level refers to updating the policy (for reward maximizing). Non-iterative offline RL decouples the joint optimization (of two levels) from the training phase, where " " transferred from the inner-level to the outer-level depends on the specific choice of algorithm used. In Table1, we will summarize different choices for " ".Intriguingly, prior works under such a non-iterative framework have proposed to transfer (as " " in Q1) filtered trajectories (Chen et al., 2021), a reward-conditioned policy(Emmons et al., 2021;  Kumar et al., 2019b), and the Q-value estimation of the behavior policy(Brandfonbrener et al.,  2021; Gulcehre et al., 2021), all of which, however, partially address the aforementioned questions (we will elaborate these works in Table1). In this work, we propose a new alternative method that transfers an embedding-conditioned (Q-value) score model and we will show that this method sufficiently answers the above questions and benefits most from the non-iterative framework.

