DROP: CONSERVATIVE MODEL-BASED OPTIMIZA-TION FOR OFFLINE REINFORCEMENT LEARNING Anonymous

Abstract

In this work, we decouple the iterative (bi-level) offline RL from the offline training phase, forming a non-iterative bi-level paradigm that avoids the iterative error propagation over two levels. Specifically, this non-iterative paradigm allows us to conduct inner-level optimization in training (i.e., employing policy/value regularization), while performing outer-level optimization in testing (i.e., conducting policy inference). Naturally, such paradigm raises three core questions (that are not fully answered by prior non-iterative offline RL counterparts like rewardconditioned policy): (Q1) What information should we transfer from the innerlevel to the outer-level? (Q2) What should we pay attention to when exploiting the transferred information for the outer-level optimization? (Q3) What are the benefits of concurrently conducting outer-level optimization during testing? Motivated by model-based optimization (MBO), we proposed DROP, which fully answers the above three questions. Particularly, in the inner-level, DROP decomposes offline data into multiple subsets, and learns a MBO score model (A1). To keep safe exploitation to the score model in the outer-level, we explicitly learn a behavior embedding and introduce a conservative regularization (A2). During testing, we show that DROP permits deployment adaptation, enabling an adaptive inference across states (A3). Empirically, we evaluate DROP on various tasks, showing that DROP gains comparable or better performance compared to prior methods.

1. INTRODUCTION

Offline reinforcement learning (RL) (Lange et al., 2012; Levine et al., 2020) describes a task of learning a policy from previously collected static data. Due to the overestimation of values at out-ofdistribution (OOD) state-actions, recent iterative offline RL methods introduce various policy/value regularization to avoid deviating from the offline data distribution (or support) in the training phase. Then, these methods directly deploy the learned policy in an online environment to test the performance. To unfold our following analysis, we term this kind of learning procedure as iterative bi-level offline RL (Figure 1 left), wherein the inner-level optimization refers to trying to eliminate the OOD issue by constraining the policy/value function, the outer-level optimization refers to trying to learn a better policy that will be employed at testing. Here, we use the "iterative" term to emphasize that the inner-level and outer-level are iteratively optimized in the training phase. However, without enough inner-level optimization (OOD regularization), there is still a distribution shift between the behavior policy and the policy to be evaluated. Further, due to the iterative error exploitation and propagation (Brandfonbrener et al., 2021) over the two levels, performing such an iterative bi-level optimization completely in training often struggles to learn a stable policy/value function. In this work, we thus advocate for non-iterative bi-level optimization (Figure 1 right) that decouples the bi-level optimization from the training phase, namely, performing inner-level optimization (eliminating OOD) in training and performing outer-level optimization (updating policy) in testing. Intuitively, incorporating the outer-level optimization into the testing phase can eliminate the iterative error propagation over the two levels. Then, three core questionsfoot_0 are: (Q1) What information (" ") should we transfer from the inner-level to the outer-level? (Q2) What should we pay special attention to when we exploit " " for outer-level optimization? (Q3) Notice that the outer-level optimization and the online rollout test form a new loop (" "), what new benefit does this give us?



Next we will use A1, A2, and A3 to denote our answers to the raised questions (Q1, Q2, and Q3) respectively. 1

