TRUST, BUT VERIFY: MODEL-BASED EXPLORATION IN SPARSE REWARD ENVIRONMENTS

Abstract

We propose trust-but-verify (TBV) mechanism, a new method which uses model uncertainty estimates to guide exploration. The mechanism augments graph search planning algorithms with the capacity to deal with learned model's imperfections. We identify certain type of frequent model errors, which we dub false loops, and which are particularly dangerous for graph search algorithms in discrete environments. These errors impose falsely pessimistic expectations and thus hinder exploration. We confirm this experimentally and show that TBV can effectively alleviate them. TBV combined with MCTS or Best First Search forms an effective model-based reinforcement learning solution, which is able to robustly solve sparse reward problems.

1. INTRODUCTION

Model-based approach to Reinforcement Learning (RL) brings a promise of data efficiency, and with it much greater generality. However, it is still largely an open question of how to make robust model-based RL algorithms. In most cases, the current solutions excel in low sample regime but underperform asymptotically, see Wang et al. (2019) ; Nagabandi et al. (2018a) ; Kaiser et al. (2020) . The principal issues are the imperfections of the learned model and fragile planners not able to robustly deal with these imperfections, see (Sutton & Barto, 2018, Section 8.3) , (François-Lavet et al., 2018 , Section 6.2), (Wang et al., 2019, Section 4.5) . Model errors are unavoidable in any realistic RL scenario and thus need to be taken into account, particularly when planning is involved. They can be classified into two categories: optimistic and pessimistic, see (Sutton & Barto, 2018, Section 8.3 ). The former is rather benign or in some cases even beneficial, as it can boost exploration by sending the agent into falsely attractive areas. It has a self-correcting mechanism, since the newly collected data will improve the model. In contrast, when the model has a pessimistic view on a state, the agent might never have the incentive to visit it, and consequently, to make the appropriate adjustments. In this work, we propose trust-but-verify (TBV) mechanism, a new method to augment planners with capacity to prioritise visits in states for which the model is suspected to be pessimistic. The method is based on uncertainty estimates, is agnostic to the planner choice, and is rooted in the statistical hypothesis testing framework. Taking advantage of the graph structure of the underlying problem might be beneficial during planning but also makes it more vulnerable to model errors. We argue that graph search planners might benefit most from utilising TBV. While TBV can correct for any type of pessimist errors, we explicitly identify certain type of model errors which frequently occur in discrete environments, which we dub false loops. In a nutshell, these are situations when the model falsely equates two states. Such errors are much similar to (Sutton & Barto, 2018, Example 8.3) , in which the model erroneously imagine 'bouncing off' of a non-existing wall and thus failing to exploit a shortcut. From a more general point of view, the problem is another incarnation of the exploration-exploitation dilemma. We highlight this by evaluating our method in sparse reward environments. In a planning context, exploration aims to collect data for improving the model, while exploitation aims to act optimally with respect to the current one. The problem might be solved by encouraging to revisit state-action pairs with an incentive of 'bonus rewards' based on the state-action visitation frequency. Such a solution, (Sutton & Barto, 2018, Dyna-Q+), is restricted to tabular cases. TBV can be though as its scalable version. • TBV -a general risk-aware mechanism of dealing with pessimistic model errors, rooted in the statistical hypothesis learning framework. • Conceptual sources and types of model errors accompanied with empirical evidence. • Practical implementation of TBV with: classical Best First Search -a classical graph search algorithm family and Monte Carlo Tree Search -the state-of-the-art class of planners in many challenging domains. • Empirical verification of TBV behaviour in two sparse rewards domains: ToyMontezumaRevenge Roderick et al. ( 2018) and the Tower of Hanoi puzzle. The code of our work is available at: https://github.com/ComradeMisha/ TrustButVerify.

2. RELATED WORK

There is a huge body of work concerning exploration. Fundamental results in this area come from the multi-arm bandits theory, including the celebrated UCB algorithm (Auer et al. ( 2002)) or Thompson sampling (Thompson (1933) ), see also Lattimore & Szepesvári (2020) for a thorough treatment of the subject. There are multiple variants of UCB, some of which are relevant to this work, including UCB-V Audibert et al. ( 2007) and tree planning adaptations, such as UCT (Kocsis et al. ( 2006)) or PUCT (Silver et al. (2017; 2018) ). Classical approach to exploration in the reinforcement learning setting, including the principle of optimism in the face of uncertainty and ε-greedy exploration, can be found in Sutton & Barto (2018) . Exploration in the form of entropy-based regularization can be found in A3C (Mnih et al. ( 2016 



)) and SAC (Haarnoja et al. (2018)). Plappert et al. (2017) and Fortunato et al. (2017) introduce noise in the parameter space of policies, which leads to state-dependent exploration. There are multiple approaches relying on reward exploration bonuses, which take into account: prediction error (Stadie et al. (2015), Schmidhuber (2010) Pathak et al. (2019), Burda et al. (2018)), visit count (Bellemare et al. (2016), Ostrovski et al. (2017)), temporal distance (Machado et al. (2020)), classification (Fu et al. (2017)), or information gain (Sun et al. (2011), Houthooft et al. (2016)). An ensemble-based reinforcement learning counterpart of Thompson sampling can be found in Osband et al. (2016) and Osband et al. (2019). An exploration driven by experience-driven goal generation can be found in Andrychowicz et al. (2017). A set of exploration methods has been developed in an attempt to solve notoriously hard Montezuma's Revenge, see for example Ecoffet et al. (2019); Guo et al. (2019). In the context of model-based planning Lowrey et al. (2019) and Milos et al. (2019) use value function ensembles and uncertainty aware exploration bonuses (log-sum-exp and majority vote, respectively). More recently, Pathak et al. (2017), Shyam et al. (2019), Henaff (2019) and Sekar et al. (2020) dealt with exploration and model learning. These works are similar in spirit to ours. They train exploration policies intending to reduce the model error (e.g. by maximizing the disagreement measure). Our work differs from the above in the following ways. First, we aim to use powerful graph search techniques as planners simultaneously to learning of the model. Our method addresses the fundamental issue, which is balancing intrinsic planner errors and the ones stemming from model imperfection. Second, we use the prediction error of the model ensemble to measure model uncertainty and apply a statistical hypothesis testing framework to switch between actions suggested by the model and actions leading to model improvement. Third, we use a discrete model and a discrete online planning regime. When a learned model is unrolled during planning, errors typically accumulate dramatically. Numer of works adress this problem. Racanière et al. (2017) and Guez et al. (2018) are two approaches to learn the planning mechanism and make it robust to model errors. In a somewhat similar spirit, Eysenbach et al. (2019) treats the replay buffer as a non-parametric model forming a graph and uses ensembles of learned distances as a risk-aware mechanism to avoid certain types of model errors. Nagabandi et al. (2018b) successfully blends the strengths of model-based and model-free approaches. PlaNet (Hafner et al. (2019)) and Dreamer Hafner et al. (2020) train latent models, which are used for planning. Conceptually, a similar route was explored in MuZero (Schrittwieser et al. (2019)) and Universal Planning Networks (Srinivas et al. (2018)). Farquhar et al. (2017) and Oh et al.

