MODEL-BASED VALUE EXPLORATION IN ACTOR-CRITIC DEEP REINFORCEMENT LEARNING

Abstract

Off-policy method has demonstrated great potential on model-free deep reinforcement learning due to the sample-efficient advantage. However, it suffers extra instability due to some mismatched distributions from observations. Model-free onpolicy counterparts usually have poor sample efficiency. Model-based algorithms, in contrast, are highly dependent on the goodness of expert demonstrations or learned dynamics. In this work, we propose a method which involves training the dynamics to accelerate and gradually stabilize learning without adding samplecomplexity. The dynamics model prediction can provide effective target value exploration, which is essentially different from the methods on-policy exploration, by adding valid diversity of transitions. Despite the existence of model bias, the model-based prediction can avoid the overestimation and distribution mismatch errors in off-policy learning, as the learned dynamics model is asymptotically accurate. Besides, to generalize the solution to large-scale reinforcement learning problems, we use global gaussian and deterministic function approximation to model the transition probability and reward function, respectively. To minimize the negative impact of potential model bias brought by the estimated dynamics, we adopt one-step global prediction for the model-based part of target value. By analyses and proofs, we show how the model-based prediction provides value exploration and asymptotical performance to the overall network. It can also be concluded that the convergence of proposed algorithm only depends on the accuracy of learnt dynamics model.

1. INTRODUCTION

Model-free reinforcement learning (RL) algorithms have been applied to a wide range of tasks, ranging from simple games (Mnih et al., 2013; Oh et al., 2016) to robotic locomotion skills (Schulman et al., 2015) . To tackle the large-scale continuous control problems, the function approximators implement some neural networks to represent the high-dimensional state and action spaces in deep reinforcement learning (DRL). However, model-free DRL is notoriously expensive in terms of its sample efficiency, which is deadly difficult to be employed in reality where samples are valuable to achieve. Among the recent model-free DRL algorithms, on-policy methods (Schulman et al., 2015; 2017; Fujimoto et al., 2018) typically require multiple samples to be collected for each rollout at every gradient step, which is quite extravagant in consuming samples because multiplied data requirement does not necessarily bring corresponding performance gain. In comparison, off-policy methods aim to reuse the past experience by storing the collected observations in a memory buffer, typically, combining Q-learning with neural networks (Mnih et al., 2015) . Unfortunately, the combination of off-policy learning and high-dimensional, nonlinear function approximation are exposed to issues in terms of instability and divergence (Maei et al., 2009) . The causes for the emergent problems are very complicated, for example, some works (Fujimoto et al., 2018; 2019; Duan et al., 2021) blame them on the overestimation bias, which says the continually maximized value during the actor-critic optimization will accumulate the overestimation errors and break the training stability. Some others try to find extrapolation error induced by the mismatch between the distribution of sampled data from experience and true state-action visitation of the current policy (Fujimoto et al., 2019) . There have been several ways to tackle the distribution mismatch. The authors in (Wu et al., 2019) address the distribution errors by extra value penalty or policy regularization, (Wang & Ross, 2019) changes the rule of experience replay to reduce the distribution mismatch by sampling more aggressively from recent experience while ordering the updates to ensure that updates from old data do not overwrite updates from new data, and (Martin et al., 2021) relabels successful episodes as expert demonstrations for the agent to match. Despite their efforts, the overestimation bias and mismatched distribution from past experience can only be mitigated, and sometimes may induce new problems. The paper has the following contributions. First, instead of using immediate rewards or assuming known reward function, we adopt neural networks to approximate the reward function as part of dynamics. Meanwhile, we train the parameters of modeled transition probability and reward function based on the replay buffer from off-policy observations. Second, the prediction from the learned dynamics will be used to foresee the target value according to a certain percentage. Since the dynamics-prediction is essentially different from the observations from environment, it can provide extra exploration which is not conditioned on the state-action visitation history. Besides, a well trained dynamics model is free of overestimation and distribution mismatch errors, and can provide more accurate target value and stabilize the asymptotic performance. Third, the related algorithm is proposed and the final results prove good efficiency and stability of the proposed algorithm. Fourth, the accuracy of learned model is tested by setting a maximum online time step, which is the beginning of off-line planning that is isolated from the environment.

2. RELATED WORK

Due to the various problems arising from the sample complexity of model-free algorithms, taskspecific representations (Peters et al., 2010; Deisenroth et al., 2013) as well as the model-based algorithms (Deisenroth & Rasmussen, 2011; Levine et al., 2016; 2018; Kaiser et al., 2019) using planning, which optimize the policy under a learned or given dynamics model, are more preferable in real physical systems, such as robots and autonomous vehicles. However, task-specific representations have limited range of learnable tasks and greater requirement for domain knowledge. Model-based DRL algorithms are considered being more efficient (Deisenroth et al., 2013) , because it constructs a dynamic probabilistic model via lots of data and avoids interaction with the environment by training the strategy based on the learned dynamics model (Hua et al., 2021) , but it limits the policy to only be as good as the learned model (Gu et al., 2016) . For the model-free part, the agent needs to interact with the environment to collect enough knowledge for training, which poses the importance of the tradeoff between exploration and exploitation (Mnih et al., 2016) . Soft actor-critic (SAC) (Haarnoja et al., 2018a; b) achieves good performance on a set of continuous control tasks by adopting stochastic function approximation and maximum entropy for policy exploration. Among these techniques, stochastic policies have the advantage of allowing on-policy exploration and off-policy experience replay over deterministic counterparts (Heess et al., 2015) , and the maximum entropy exploration improves robustness and stability (Ziebart et al., 2008; Ziebart, 2010) . Overall, the existing exploration strategies are limited to the policy, which raises the concern about whether and how the value exploration can play a positive role in model-free learning. While some works combine both model-free and model-based DRL in the literature (Sutton, 1990; Lampe & Riedmiller, 2014) , the following works are particularly relevant to our work in this paper. Specifically, (Gu et al., 2016; Nagabandi et al., 2018) add synthetic imagination rollouts to an additional replay buffer for model-guided exploration in some off-policy methods at the price of much higher storage and computation costs. Besides, model ensembles are adopted in (Chua et al., 2018; Kurutach et al., 2018; Janner et al., 2019) to reduce misguided policy or inaccurate planning caused by model bias. Moreover, value expansion of fixed multi-step prediction by dynamics model is adopted in (Feinberg et al., 2018; Buckman et al., 2018) to make proper value expansion and control imagination depth. However, making multi-step prediction from a global dynamics model may suffer cumulative model estimation errors and is usually replaced by iteratively refitted time-varying linear models (Levine & Abbeel, 2014) . VIME (Houthooft et al., 2016) introduces maximization of information gain about the dynamics' certainty, which is overwhelmed by theoretical analyses and lack a bit intuitive judgement. In this paper, we adopt one-step prediction, which avoids the costs of storage and computation from multi-step synthetic rollouts, from a global dynamics model used for value exploration to achieve diversity, accuracy and generality.

