MODEL-BASED VALUE EXPLORATION IN ACTOR-CRITIC DEEP REINFORCEMENT LEARNING

Abstract

Off-policy method has demonstrated great potential on model-free deep reinforcement learning due to the sample-efficient advantage. However, it suffers extra instability due to some mismatched distributions from observations. Model-free onpolicy counterparts usually have poor sample efficiency. Model-based algorithms, in contrast, are highly dependent on the goodness of expert demonstrations or learned dynamics. In this work, we propose a method which involves training the dynamics to accelerate and gradually stabilize learning without adding samplecomplexity. The dynamics model prediction can provide effective target value exploration, which is essentially different from the methods on-policy exploration, by adding valid diversity of transitions. Despite the existence of model bias, the model-based prediction can avoid the overestimation and distribution mismatch errors in off-policy learning, as the learned dynamics model is asymptotically accurate. Besides, to generalize the solution to large-scale reinforcement learning problems, we use global gaussian and deterministic function approximation to model the transition probability and reward function, respectively. To minimize the negative impact of potential model bias brought by the estimated dynamics, we adopt one-step global prediction for the model-based part of target value. By analyses and proofs, we show how the model-based prediction provides value exploration and asymptotical performance to the overall network. It can also be concluded that the convergence of proposed algorithm only depends on the accuracy of learnt dynamics model.

1. INTRODUCTION

Model-free reinforcement learning (RL) algorithms have been applied to a wide range of tasks, ranging from simple games (Mnih et al., 2013; Oh et al., 2016) to robotic locomotion skills (Schulman et al., 2015) . To tackle the large-scale continuous control problems, the function approximators implement some neural networks to represent the high-dimensional state and action spaces in deep reinforcement learning (DRL). However, model-free DRL is notoriously expensive in terms of its sample efficiency, which is deadly difficult to be employed in reality where samples are valuable to achieve. Among the recent model-free DRL algorithms, on-policy methods (Schulman et al., 2015; 2017; Fujimoto et al., 2018) typically require multiple samples to be collected for each rollout at every gradient step, which is quite extravagant in consuming samples because multiplied data requirement does not necessarily bring corresponding performance gain. In comparison, off-policy methods aim to reuse the past experience by storing the collected observations in a memory buffer, typically, combining Q-learning with neural networks (Mnih et al., 2015) . Unfortunately, the combination of off-policy learning and high-dimensional, nonlinear function approximation are exposed to issues in terms of instability and divergence (Maei et al., 2009) . The causes for the emergent problems are very complicated, for example, some works (Fujimoto et al., 2018; 2019; Duan et al., 2021) blame them on the overestimation bias, which says the continually maximized value during the actor-critic optimization will accumulate the overestimation errors and break the training stability. Some others try to find extrapolation error induced by the mismatch between the distribution of sampled data from experience and true state-action visitation of the current policy (Fujimoto et al., 2019) . There have been several ways to tackle the distribution mismatch. The authors in (Wu et al., 2019) address the distribution errors by extra value penalty or policy regularization, (Wang & Ross, 2019) changes the rule of experience replay to reduce the

