ON THE IMPORTANCE OF THE POLICY STRUCTURE IN OFFLINE REINFORCEMENT LEARNING Anonymous authors Paper under double-blind review

Abstract

Offline reinforcement learning (RL) has attracted a great deal of attention recently as an approach to utilizing past experience to learn a policy. Recent studies have reported the challenges of offline RL, such as estimating the values of actions that are out of the data distribution. To mitigate the issues of offline RL, we propose an algorithm that leverages a mixture of deterministic policies. With our framework, the state-action space is divided by learning discrete latent variables, and sub-policies corresponding to each region are trained. The proposed algorithm, which we call Value-Weighted Variational Auto-Encoder (V2AE), is derived by considering the variational lower bound of the offline RL objective function. The aim of this work is to shed light on the importance of the policy structure in offline RL. We show empirically that the use of the proposed mixture policy can reduce the accumulation of the critic loss in offline RL, which was reported in previous studies. Experimental results also indicate that introducing the policy structure improves the performance on tasks with D4RL benchmarking datasets.

1. INTRODUCTION

Reinforcement learning (RL) (Sutton & Barto, 2018) has had remarkable success in a variety of applications. Many of its successes have been achieved in online learning settings where the RL agent interacts with the environment during the learning process. However, such interactions are often time consuming and computationally expensive. The desirability of reducing the number of interactions in RL has motivated an active interest in offline RL (Levine et al., 2020) , also known as batch RL (Lange et al., 2012) . In offline RL, the goal is to learn the optimal policy from a prepared dataset collected through an arbitrary and unknown process. Prior work on offline RL has focused on how to avoid estimating the Q-values of action that are out of the data distribution (Fujimoto et al., 2019; Fujimoto & Gu, 2021) . While previous studies often address this issue in terms of the regularization of critics (Kumar et al., 2020; An et al., 2021; Kostrikov et al., 2021; 2022) , we propose to mitigate the issue from the perspective of the policy structure. Our hypothesis is that evaluation of the out-of-distribution actions can be avoided by dividing the state-action space, which is potentially achieved by learning discrete latent variables of the state-action space. When the data distribution is multimodal, as shown in Figure 1 (a), fitting a policy modeled with a unimodal distribution such as a Gaussian distribution may lead to interpolation between separate modes, which will result in the value estimation of actions that are out of the data distribution (Figure 1(b) ). To avoid this, we employ a mixture of deterministic policies (Figure 1(c) ). We divide the state-action space and learn sub-policies for each region. Ideally, this approach will enable us to avoid interpolating separate modes of the data distribution. In this study, we propose to train a mixture policy by learning discrete latent representations, which can be interpreted as dividing the state-action space and learning sub-policies that correspond to each region. We derive the proposed algorithm by considering the variational lower bound of the offline RL objective function. We refer to the proposed algorithm as Value-Weighted Variational Auto-Encoder (V2AE). The main contribution of this study is an offline RL algorithm that trains a mixture policy by learning discrete latent variables. We also propose a regularization technique for a mixture policy based on the mutual information. We empirically show that the proposed regularization technique improves the performance of the proposed algorithm. A previous study in (Brandfonbrener et al., 2021) reports the accumulation of the critic loss values during the training phase, which was considered the result of generating out-of-distribution actions. We show empiri-

Samples with high Q-values

Region there is no datapoint (a) Samples in state-action space. Unimodal distribution fitted to samples (b) Result of fitting a unimodal distribution. Fit a deterministic policy for each region (c) Proposed approach. Figure 1 : Schematic illustration of the proposed approach. (a) In offline RL, the distribution of samples is often multimodal; (b) Fitting a unimodal distribution to such samples can lead to estimating the action out of the data distribution; (c) In the proposed approach, the latent discrete variable of the state-action space is learned, and a deterministic policy is learned for each region. cally that the use of the proposed mixture policy can reduce the accumulation of the approximation error in offline RL. In experiments with benchmark tasks in D4RL (Fu et al., 2020) , the proposed algorithms proved to be competitive with the popular offline RL methods. While the experimental result shows the promising performance , we aim to shed light on the importance of the policy structure as inductive bias in offline RL, rather than claim the state-of-the-art performance.

2. RELATED WORK

Recent studies have shown that regularization is the crucial component for offline RL (Fujimoto et al., 2019; Kumar et al., 2020; Levine et al., 2020; Kostrikov et al., 2021) . For example, Kostrikov et al. ( 2021) proposed a regularization based on Fischer divergence, and Fujimoto & Gu (2021) showed that simply adding a behavior cloning term to the objective function in TD3 can achieve state-of-the-art performance on D4RL benchmark tasks (Fu et al., 2020) . Other research has investigated the structure of the critic, proposing the use of an ensemble of critics (An et al., 2021) or offering a one-step offline RL approach (Brandfonbrener et al., 2021; Goo & Niekum, 2021) . Previous studies (Fujimoto et al., 2019; Fujimoto & Gu, 2021) have indicated that the source of the value approximation error is "extrapolation error" that occurs when the value of state-action pairs that are not contained in a given dataset is estimated. Our hypothesis is that such "extrapolation error" can be mitigated by dividing the state-action space, which is potentially achieved by learning discrete latent variables. We investigate the effect of incorporating the policy structure as inductive bias in offline RL, which has not been fully investigated. Learning the discrete latent variable in the context of RL is closely related to a mixture policy, where a policy is represented as a combination of a finite number of sub-policies. In a mixture policy, one of the sub-policies is activated for a given state, and the module that determines which sub-policy to use is often called the gating policy (Daniel et al., 2016) . Because of the two-layered structure, a mixture policy is also called a hierarchical policy (Daniel et al., 2016) . Although we do not consider temporal abstraction in this study, we note that a well-known hierarchical RL framework with temporal abstraction is the option critic (Bacon et al., 2017) . Since we consider policies without temporal abstraction, we use the term "mixture policy," following the terminology in Wulfmeier et al. (2021) . Previous studies have demonstrated the advantages of mixture policies in online RL (Osa et al., 2019; Zhang & Whiteson, 2019; Wulfmeier et al., 2020; 2021; Akrour et al., 2021) . In these existing methods, sub-policies are often trained to cover separate modes of the Q-function, which is similar to our idea. While existing methods have leveraged the latent variable in offline RL (Zhou et al., 2020; Chen et al., 2021b; 2022) , the latent variable is continuous in these methods. As indicated by studies on latent representations (Kingma & Welling, 2014; Dupont, 2018; Brown et al., 2020) , we think that the use of the discrete latent variable should be investigated in offline RL.

3. PROBLEM FORMULATION

Reinforcement Learning Consider a reinforcement learning problem under a Markov decision process (MDP) defined by a tuple (S, A, P, r, γ, d), where S is the state space, A is the action space, P(s t+1 |s t , a t ) is the transition probability density, r(s, a) is the reward function, γ is the discount

